0% found this document useful (0 votes)

21 views

Python and NLP Notes

Gffvbgfs

Uploaded by

surajj0624

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Python and NLP Notes

Gffvbgfs

Uploaded by

surajj0624

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Unit 1: Python Text and NLP Basics

1.1 Introduction to Python Text Basics

Python offers numerous tools for handling and manipulating text. The most basic of these are
string operations, but for more advanced tasks, we use libraries like re for regular expressions
and Spacy for NLP-specific functions.

Basic String Operations

Operation Example Code Output

Lowercasing text = "HELLO"; text.lower() 'hello'

Splitting text = "Hello, World!"; ['Hello', '

text.split(',') World!']

Concatenation a = "Hello"; b = "World"; c = a + " " 'Hello World'

+ b

Replacing text = "I am happy"; 'I am sad'

text.replace('happy', 'sad')

String operations in Python are efficient for simple text processing tasks, such as breaking a
sentence into words, converting text to lowercase, or replacing substrings.

File Handling in Python

Operation Description Example Code Output

Reading a Reads the entire content of file.read() Contents of

File the file. file

Reading Reads the file line by line and file.readlines() List of lines in
Lines returns a list. file

Writing to a Writes data to the file file.write("Hello -

File (overwrites existing data). World")

Example:

with open('example.txt', 'r', encoding='utf-8') as file:

content = file.read()
print(content) # Output: contents of 'example.txt'

1.2 Working with PDFs

PDF Text Extraction

Python libraries such as PyPDF2 and pdfminer.six are commonly used to extract text from
PDF documents.

Library Description Example Code Output

Example

PyPDF2 Simple library text = Text from

to extract PDF pdf_reader.getPage(0).extractText( page 1 of
text. ) PDF

pdfminer.six More powerful extract_text('file.pdf') Complete

for text text from
extraction. PDF

Example:

import PyPDF2

with open('sample.pdf', 'rb') as pdf_file:

reader = PyPDF2.PdfFileReader(pdf_file)
text = reader.getPage(0).extractText()
print(text) # Outputs text from the first page

1.3 Introduction to Regular Expressions (Regex)

Regular expressions allow us to define complex patterns to search, match, or manipulate text.
Python’s re module provides a variety of functions to work with regex.
Regex Description Example Code Output
Operation

Finding Find all occurrences of a re.findall(r'\d+', ['123']

Patterns pattern in text. 'User123 data')

Substituting Replace parts of the text re.sub(r'\d+', 'ID', 'UserID'

Patterns that match a pattern. 'User123')

Shorthand Predefined classes for re.findall(r'\w+', ['Text',

Classes matching. 'Text 123') '123']

Character Matches a range of re.findall(r'[A-Z]', ['H', 'W']

Ranges characters. 'Hello World')

Common Regex Patterns

Pattern Meaning Example Matches

\d Any digit (0-9) \d+ "123" from "User123"

\w Any word character (a-z, A-Z, 0-9) \w+ "User123"

\s Any whitespace character \s+ " " (space)

[a-z] Any lowercase letter [a-z]+ "text" from "text"

Example: Removing Special Characters

python

import re
text = "Hello! Welcome to NLP 101."
clean_text = re.sub(r'[^A-Za-z\s]', '', text) # Removes anything that
is not a letter or space
print(clean_text) # Output: "Hello Welcome to NLP"

1.4 Preprocessing using Regex

Preprocessing is the crucial first step in any NLP pipeline, ensuring that the data is cleaned and
normalized before being fed into algorithms.
Preprocessi Regex Pattern / Example Output
ng Task Operation

Remove re.sub(r'http\S+', "Visit "Visit "

URLs '', text) http://example.c
om"

Remove re.sub(r'[^A-Za-z0- "Hello, World!" "Hello World"

Special 9\s]', '', text)
Characters

Extract re.findall(r'\S+@\S "Contact me at ["john@example.co

Email +', text) john@example.com m"]
Addresses "

Replace re.sub(r'\d+', "My number is "My number is

Digits 'NUM', text) 12345" NUM"

Example: Removing Digits

text = "The price is 123 dollars."

clean_text = re.sub(r'\d+', 'NUM', text)
print(clean_text) # Output: "The price is NUM dollars."

1.5 Introduction to Natural Language Processing (NLP)

NLP involves enabling machines to understand, interpret, and generate human language. It
combines computer science, linguistics, and machine learning techniques.

Key Applications of NLP

Application Description Example

Chatbots Automate customer service Virtual assistants like Siri, Alexa

and support

Sentiment Classifying the sentiment of Analyzing movie reviews

Analysis text (positive/negative)

Machine Translating text between Google Translate

Translation languages
Speech Converting spoken language to Speech-to-text in Google Docs
Recognition text

Challenges in NLP

Challenge Description Example

Ambiguity Words or sentences with multiple "The bank is on the river bank." (bank as
meanings. financial or riverbank)

Variety Variations in language use across British English vs. American English:
dialects, regions, etc. colour vs. color

1.6 Role of Machine Learning in NLP

Modern NLP relies heavily on machine learning, particularly deep learning, to automatically
detect patterns in language. The following models are popular in NLP:

Model Description Example

Bag of Words (BoW) Text represented as a bag of 'I love NLP' → {'I':
individual words, ignoring order. 1, 'love': 1, 'NLP':
1}

TF-IDF Weighting scheme where frequent 'I love NLP' → weighted

but less important words are matrix
down-weighted.

RNN (Recurrent Models sequences and Used for machine translation

Neural Network) dependencies in text. or text generation

Transformers Advanced model that captures Used in GPT, BERT for tasks
global context across sentences. like summarization

Bag of Words Example

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(["I love NLP", "I love programming"])
print(X.toarray()) # Output: BoW matrix
1.7 Spacy Basics

Spacy is a popular NLP library in Python, known for its efficiency and ease of use. Key features
include tokenization, part-of-speech tagging, and named entity recognition.

Tokenization

Tokenization refers to splitting text into words or sentences.

Operation Example Code Output

Word Tokenization tokens = [token.text for ['I', 'love', 'NLP']

token in doc]

Sentence sentences = list(doc.sents) ['I love NLP.', 'It is

Tokenization amazing.']

Example: Tokenization

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Natural Language Processing is exciting!")
tokens = [token.text for token in doc]
print(tokens) # Output: ['Natural', 'Language', 'Processing', 'is',
'exciting', '!']

1.8 Stemming, Lemmatization, Stop Words

Operation Description Example Code Output

Stemming Reducing words to stemmer.stem("running") 'run'

their root form.

Lemmatization Converting words to [token.lemma_ for token in ['run',

their base dictionary doc] 'be']
form.
Stop Words Common words that [token for token in doc if List of
can be removed not token.is_stop] non-stop
during processing. words

Example: Lemmatization

doc = nlp("The children are playing.")

lemmas = [token.lemma_ for token in doc]
print(lemmas) # Output: ['the', 'child', 'be', 'play', '.']

1.9 Phrase Matching and Vocabulary

Phrase matching is used to search for multi-word expressions in text, which are often significant
in NLP tasks like entity recognition or keyword extraction.

Example: Phrase Matching

from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)
patterns = [nlp(text) for text in ["machine learning", "natural
language processing"]]
matcher.add("TechTerms", None, *patterns)

doc = nlp("I love machine learning and natural language processing.")

matches = matcher(doc)
for match_id, start, end in matches:
print(doc[start:end].text) # Output: 'machine learning', 'natural
language processing'

Unit 2: Part of Speech Tagging and Named Entity Recognition (NER)

2.1 Part of Speech Tagging (POS)

POS Tagging is the process of labeling each word in a sentence with its respective part of
speech, such as noun, verb, adjective, etc. POS tagging is a fundamental part of many NLP
tasks, including syntactic parsing and word-sense disambiguation.

POS Tagging in Spacy

● Spacy automatically assigns POS tags using its built-in model, which labels words with
their grammatical roles.

import spacy
nlp = spacy.load("en_core_web_sm")

sentence = "Apple is looking at buying a U.K. startup."

doc = nlp(sentence)

for token in doc:

print(f"{token.text} -> {token.pos_} ({token.tag_})")

Common POS Tags

POS Tag Full Form Example Description

NOUN Noun Apple A person, place, thing, or idea

VERB Verb buying Action or state of being

ADJ Adjective startup Describes a noun

PROPN Proper Noun U.K. Specific names of people, places

ADV Adverb quickly Modifies a verb, adjective, or adverb

AUX Auxiliary Verb is Helps form different tenses

Example Output:
rust

Apple -> PROPN (NNP)

is -> AUX (VBZ)
looking -> VERB (VBG)
at -> ADP (IN)
buying -> VERB (VBG)
a -> DET (DT)
U.K. -> PROPN (NNP)
startup -> NOUN (NN)

POS Tagging vs Named Entity Recognition

Feature POS Tagging Named Entity Recognition (NER)

Purpose Labels words as nouns, verbs, etc. Identifies proper nouns and classifies them
(e.g., person, organization)

Example Verb (run), Noun (book) Person (John), Organization (Google),

Location (Paris)

Use Syntactic parsing, understanding Identifying named entities in text for

Cases sentence structure information extraction

2.2 Named Entity Recognition (NER)

Named Entity Recognition (NER) identifies and classifies named entities mentioned in the text
into predefined categories, such as persons, organizations, locations, dates, etc.

Example of NER in Spacy

doc = nlp("Apple is looking at buying a startup in the U.K.")

for ent in doc.ents:
print(ent.text, ent.label_)

NER Labels and Their Meanings

Entity Label Full Form Example Description

PERSON Person Elon Musk Recognizes people’s names

ORG Organization Apple, Google Recognizes corporate organizations

GPE Geopolitical Entity U.K., Germany Recognizes countries, cities, states

DATE Date July 2020 Recognizes dates

MONEY Monetary Value $500 Recognizes currency values

Example Output:
Apple -> ORG
U.K. -> GPE

Comparison of POS Tagging and NER

Feature POS Tagging Named Entity Recognition (NER)

Purpose Assigns part-of-speech labels Identifies and categorizes named entities

to tokens

Example Verb (run), Noun (city) Person (Elon Musk), Organization (Apple),
GPE (U.K.)

Applications Language structure analysis Information extraction, Named entity

categorization

2.3 Sentence Segmentation

Sentence segmentation is the process of splitting text into individual sentences. It is a critical
step in NLP for understanding sentence boundaries and structure.

Example of Sentence Segmentation in Spacy

text = "Hello! How are you? I'm doing well."

doc = nlp(text)

for sent in doc.sents:

print(sent.text)

Example Output:

Hello!
How are you?
I'm doing well.

Techniques for Sentence Segmentation

Technique Description Example

Rule-based Uses punctuation and specific markers (e.g., split based on . or ?

periods, question marks) to split sentences.

ML-based Uses machine learning models to learn Models trained on annotated

sentence boundaries. corpora to detect sentence
ends.

2.4 Text Modeling using the Bag of Words Model

The Bag of Words (BoW) model represents text data as a collection of words, ignoring grammar
and word order but maintaining frequency counts of each word.

Bag of Words Example

python

from sklearn.feature_extraction.text import CountVectorizer

corpus = ["I love NLP", "NLP is amazing", "I love programming"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray()) # Outputs the BoW matrix

Bag of Words Matrix

Sentence I love NLP is amazing programming

I love NLP 1 1 1 0 0 0

NLP is amazing 0 0 1 1 1 0

I love programming 1 1 0 0 0 1

Advantages and Limitations of Bag of Words

Advantages Limitations

Simple and easy to implement Ignores word order

Works well for simple text classification tasks Does not capture semantic meaning of words
2.5 Text Modeling using the TF-IDF Model

TF-IDF (Term Frequency-Inverse Document Frequency) is an advanced text representation

model that weighs terms based on their frequency in a document and their inverse frequency in
the entire corpus. This reduces the weight of common terms like “the” and “is.”

TF-IDF Formula:

● Term Frequency (TF) = Number of occurrences of the word in a documentTotal number

of words in the document\frac{\text{Number of occurrences of the word in a
document}}{\text{Total number of words in the document}}Total number of words in the
documentNumber of occurrences of the word in a document
● Inverse Document Frequency (IDF) = log⁡(Total number of documentsNumber of
documents containing the word)\log \left( \frac{\text{Total number of
documents}}{\text{Number of documents containing the word}} \right)log(Number of
documents containing the wordTotal number of documents)
● TF-IDF = TF * IDF

TF-IDF Example in Python

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(corpus)
print(X_tfidf.toarray()) # Outputs the TF-IDF matrix

Comparison of BoW and TF-IDF

Model Description Use Case

Bag of Represents text as a collection of Simple text classification tasks.

Words word frequencies.

TF-IDF Weighs words by frequency and Better for tasks where word significance
importance in the corpus. matters, like information retrieval.

2.6 Understanding the N-Gram Model

An N-Gram is a contiguous sequence of n items (words, characters, etc.) from a given text.
N-Grams capture local context by analyzing adjacent words or characters.
Types of N-Grams

N-Gram Type Example

Unigram (n=1) "I", "love", "NLP"

Bigram (n=2) "I love", "love NLP"

Trigram (n=3) "I love NLP", "love NLP courses"

Example: Generating Bigrams

from sklearn.feature_extraction.text import CountVectorizer

bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
X_bigram = bigram_vectorizer.fit_transform(corpus)
print(bigram_vectorizer.get_feature_names_out()) # Outputs list of
bigrams

N-Gram Applications

● Unigrams: Often used in simple text classification tasks.

● Bigrams/Trigrams: Useful in language models where word context is important (e.g.,
machine translation, speech recognition).

Example N-Gram Usage:

Sentence: "I love NLP"

Unigrams: ["I", "love", "NLP"]
Bigrams: ["I love", "love NLP"]

2.7 Latent Semantic Analysis (LSA)

Latent Semantic Analysis (LSA) is a technique in natural language processing that helps
discover the underlying structure of relationships between terms and documents. LSA reduces
the dimensionality of text data by transforming it into a lower-dimensional space using Singular
Value Decomposition (SVD). This technique is useful for text clustering, topic modeling, and
document similarity.

Steps in LSA:

1. Construct the Term-Document Matrix (using BoW or TF-IDF).

2. Apply Singular Value Decomposition (SVD) to decompose the matrix into three matrices:
U, Σ, and V.
3. Reduce the dimensionality by selecting the top k components from the decomposition.

Formula:
A=UΣVTA = U \Sigma V^TA=UΣVT

Where:

● A is the original matrix.

● U is the matrix representing terms.
● Σ is the diagonal matrix representing the singular values.
● V^T is the matrix representing documents.

Example: Applying LSA in Python

python

from sklearn.decomposition import TruncatedSVD

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["The dog barked at the mailman.",

"The cat meowed at the dog.",
"The mailman ran from the dog."]

# Convert corpus into a TF-IDF matrix

tfidf = TfidfVectorizer()
X = tfidf.fit_transform(corpus)

# Perform SVD (LSA)

svd = TruncatedSVD(n_components=2)
X_lsa = svd.fit_transform(X)

# Output the LSA-reduced matrix

print(X_lsa)

LSA Applications:

● Topic Modeling: Identifying the underlying topics in a collection of documents.

● Information Retrieval: Improving search engine performance by finding documents with
similar meanings.
2.8 Word Synonyms and Antonyms using NLTK

In NLP, synonyms are words that have similar meanings, while antonyms are words with
opposite meanings. The NLTK (Natural Language Toolkit) provides a built-in lexical database
called WordNet to fetch synonyms and antonyms for any word.

Example: Finding Synonyms and Antonyms with NLTK

python

from nltk.corpus import wordnet

# Synonyms for "happy"

synonyms = []
for syn in wordnet.synsets("happy"):
for lemma in syn.lemmas():
synonyms.append(lemma.name())

# Antonyms for "happy"

antonyms = []
for syn in wordnet.synsets("happy"):
for lemma in syn.lemmas():
if lemma.antonyms():
antonyms.append(lemma.antonyms()[0].name())

print(f"Synonyms: {set(synonyms)}")
print(f"Antonyms: {set(antonyms)}")

Example Output:
arduino

Synonyms: {'felicitous', 'glad', 'happy'}

Antonyms: {'unhappy'}

Applications of Synonyms and Antonyms in NLP:

● Thesaurus generation.
● Word-sense disambiguation.
● Improving semantic search.
2.9 Word Negation Tracking

Word Negation Tracking refers to identifying and understanding negation in a sentence. Words
like "not", "never", "no", or "none" can drastically change the meaning of a sentence. Handling
negations is crucial for tasks like sentiment analysis or intent recognition.

Example: Negation Handling

python

import nltk
from nltk.tokenize import word_tokenize

def negate_sentence(sentence):
tokens = word_tokenize(sentence)
negation = False
result = []

for token in tokens:

if token.lower() in ["not", "never", "no"]:
negation = True
elif token == ".":
negation = False
result.append("NOT_" + token if negation else token)

return " ".join(result)

# Example sentence
sentence = "I am not happy with the service."
negated_sentence = negate_sentence(sentence)
print(negated_sentence) # Output: 'I am NOT_happy with the service .'

Applications of Negation Tracking:

● Sentiment Analysis: Identifying positive and negative opinions more accurately.

● Intent Recognition: Understanding when users are making negative statements.
Unit 3: Text Classification and Text Summarization

3.1 Text Classification

Text classification is the process of assigning labels or categories to a piece of text based on its
content. This is widely used in tasks like sentiment analysis, spam detection, and topic
classification.

Steps in Text Classification:

1. Get the Data: Collect or import the dataset.

2. Data Preprocessing: Clean and preprocess the text (remove punctuation, stop words,
etc.).
3. Transform into BoW/TF-IDF Model: Convert the text into a vector representation.
4. Train the Model: Use classification algorithms like Logistic Regression, SVM, Naive
Bayes, etc.
5. Test the Model: Evaluate the model's performance using metrics like accuracy,
precision, recall.

Example: Text Classification with Logistic Regression

python

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample data
corpus = ["I love this product!", "This is the worst experience!",
"Absolutely fantastic service!", "Terrible customer support."]
labels = [1, 0, 1, 0] # 1=positive, 0=negative

# Preprocessing: TF-IDF vectorization

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

# Split data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, labels,
test_size=0.25, random_state=42)

# Train Logistic Regression classifier

classifier = LogisticRegression()
classifier.fit(X_train, y_train)

# Predict and evaluate

y_pred = classifier.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Common Text Classification Algorithms:

Algorithm Description Example Use

Case

Logistic Regression A simple linear model for binary classification. Spam vs

Non-Spam

Naive Bayes A probabilistic classifier based on Bayes' Sentiment

theorem. analysis

Support Vector A robust linear classifier that finds the decision Fake news
Machines boundary. detection

3.2 Text Summarization

Text summarization is the process of reducing the length of a document while preserving its key
information. There are two main types of text summarization: Extractive Summarization and
Abstractive Summarization.

● Extractive Summarization: Selects key sentences or phrases directly from the original
text.
● Abstractive Summarization: Generates new sentences that summarize the content
(like how humans summarize text).

Steps in Extractive Summarization:

1. Fetch and Preprocess the Data: Get the document and clean it.
2. Tokenization: Split the text into sentences.
3. Build a Histogram: Calculate the frequency of each word.
4. Calculate Sentence Scores: Score each sentence based on the significance of its
words.
5. Select Sentences: Choose top N sentences for the summary.

Example: Extractive Summarization using NLTK

python

from nltk.tokenize import sent_tokenize, word_tokenize

from nltk.corpus import stopwords
from collections import defaultdict

# Sample text
text = """Natural Language Processing is an exciting field of
Artificial Intelligence.
It enables machines to understand and process human language. It is
widely used in chatbots, language translation, and many other
applications."""

# Step 1: Tokenize sentences

sentences = sent_tokenize(text)

# Step 2: Preprocess words

stop_words = set(stopwords.words('english'))
word_frequencies = defaultdict(int)

for word in word_tokenize(text):

if word.lower() not in stop_words and word.isalpha():
word_frequencies[word.lower()] += 1

# Step 3: Calculate sentence scores

sentence_scores = {}
for sentence in sentences:
for word in word_tokenize(sentence):
if word.lower() in word_frequencies:
if sentence not in sentence_scores:
sentence_scores[sentence] =
word_frequencies[word.lower()]
else:
sentence_scores[sentence] +=
word_frequencies[word.lower()]

# Step 4: Select top sentences for summary

summary = sorted(sentence_scores, key=sentence_scores.get,
reverse=True)[:2]
print("Summary:", " ".join(summary))

Applications of Text Summarization:

● News Summarization: Condensing lengthy news articles.

● Research Papers: Generating brief abstracts for long papers.
● Automatic Meeting Notes: Summarizing meeting transcripts for key points.

Unit 4: Semantics and Sentiment Analysis

4.1 Introduction to Semantics and Sentiment Analysis

● Semantics in NLP deals with the meaning and interpretation of words, phrases,
sentences, and larger units of text. It helps understand context, disambiguate word
meanings, and identify relationships between entities.
● Sentiment Analysis focuses on determining the emotional tone behind a body of text,
identifying whether it is positive, negative, or neutral.

4.2 Semantics and Word Vectors

Word vectors (also known as word embeddings) are numerical representations of words in a
high-dimensional space. Words that share similar contexts in a corpus tend to be closer in this
vector space. Word vectors enable semantic analysis by capturing relationships like:

● Synonymy: Words with similar meanings.

● Analogy: Word relationships (e.g., "king" is to "queen" as "man" is to "woman").

Common Word Embedding Models:

● Word2Vec: Converts words into dense vectors by training on large corpora using two
architectures: Continuous Bag of Words (CBOW) and Skip-Gram.
● GloVe (Global Vectors for Word Representation): Generates word embeddings by
factorizing word co-occurrence matrices.
● FastText: Builds on Word2Vec, but models sub-word information, making it better for
handling rare words.

Word2Vec Example:
python

import gensim
from gensim.models import Word2Vec
# Example corpus
sentences = [["I", "love", "NLP"], ["NLP", "is", "fun"], ["I",
"enjoy", "learning", "NLP"]]

# Train a Word2Vec model

model = Word2Vec(sentences, vector_size=100, window=5, min_count=1,
workers=4)

# Get word vector for "NLP"

print(model.wv['NLP'])

# Finding most similar words to "NLP"

print(model.wv.most_similar('NLP'))

Example Output (Most Similar Words to 'NLP'):

css

[('love', 0.832), ('learning', 0.810), ('fun', 0.750)]

Analogy Example:
python

# Finding analogy: "king" is to "man" as "queen" is to ?

model.wv.most_similar(positive=['king', 'woman'], negative=['man'],
topn=1)

4.3 Sentiment Analysis Overview

Sentiment Analysis is the task of analyzing a piece of text to determine the underlying
sentiment or opinion. It can classify text as positive, negative, or neutral. Sentiment analysis is
widely used in:

● Product reviews: Identifying customer opinions on products.

● Social media: Analyzing user feedback and reactions.
● Customer service: Gauging user satisfaction from responses.

Common Approaches for Sentiment Analysis:

Method Description Example

Rule-Based Uses predefined lists of positive/negative Lexicon-based

words and rules (SentiWordNet)

Machine Classifies sentiment using machine learning Logistic Regression, Naive

Learning algorithms Bayes

Deep Learning Uses neural networks to automatically learn LSTM, RNNs,

sentiment patterns Transformers

Sentiment Score Example (Lexicon-Based):

Each word is assigned a sentiment score based on its polarity (positive or negative).

● "I love this product!" → Positive sentiment (words like "love" have positive polarity).
● "The service was terrible." → Negative sentiment (words like "terrible" have negative
polarity).

4.4 Sentiment Analysis with NLTK

The Natural Language Toolkit (NLTK) provides tools for simple sentiment analysis. The
VADER (Valence Aware Dictionary for Sentiment Reasoning) model, built into NLTK, is
commonly used for lexicon-based sentiment analysis.

Example: Sentiment Analysis using NLTK's VADER

python

from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Initialize VADER sentiment analyzer

analyzer = SentimentIntensityAnalyzer()

# Example sentence
sentence = "I love this product, but the delivery was terrible."

# Calculate sentiment scores

sentiment_scores = analyzer.polarity_scores(sentence)
print(sentiment_scores) # Outputs a dictionary of scores
Example Output:
arduino

{'neg': 0.344, 'neu': 0.493, 'pos': 0.163, 'compound': -0.1531}

● neg: Negative sentiment score

● neu: Neutral sentiment score
● pos: Positive sentiment score
● compound: Overall sentiment (ranges from -1 to +1, where -1 is very negative and +1 is
very positive)

4.5 Sentiment Analysis Movie Review Project

Objective: Classify movie reviews as positive or negative.

Steps for the Sentiment Analysis Project:

1. Data Collection: Use the IMDb movie review dataset, which contains reviews labeled as
positive or negative.
2. Preprocessing: Clean the text by removing stop words, punctuation, and performing
tokenization.
3. Feature Extraction: Convert text into numerical format using Bag of Words or TF-IDF.
4. Training the Model: Train a machine learning model such as Logistic Regression or
Naive Bayes.
5. Evaluating the Model: Evaluate the model using metrics such as accuracy, precision,
recall, and F1 score.

Example Project Pipeline:

python

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample movie review data

data = pd.read_csv('IMDB_Dataset.csv')
X = data['review']
y = data['sentiment'].map({'positive': 1, 'negative': 0})
# Preprocessing: Vectorize text using TF-IDF
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
X_tfidf = tfidf.fit_transform(X)

# Split into training and test set

X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y,
test_size=0.2, random_state=42)

# Train the Logistic Regression model

classifier = LogisticRegression()
classifier.fit(X_train, y_train)

# Predict on the test set and evaluate

y_pred = classifier.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

4.6 Twitter Sentiment Analysis

Twitter Sentiment Analysis focuses on analyzing the sentiment of tweets in real-time. Since
tweets are brief and often contain informal language, they pose unique challenges for NLP. This
project involves fetching live tweets, processing them, and predicting their sentiment.

Steps for Twitter Sentiment Analysis:

1. Set up the Twitter Application: Create a Twitter developer account and get access
tokens and API keys.
2. Fetch Real-Time Tweets: Use the tweepy library to fetch tweets based on specific
hashtags or keywords.
3. Preprocessing the Tweets: Clean tweets by removing URLs, hashtags, mentions, and
special characters.
4. Predicting Sentiment: Load a pre-trained sentiment analysis model (e.g., TF-IDF and
Logistic Regression) to classify the sentiment of each tweet.
5. Visualizing Results: Plot the distribution of sentiments (positive, negative, neutral).

Example: Fetching Tweets with Tweepy and Sentiment Analysis

python

import tweepy
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Set up Tweepy API with your credentials
auth = tweepy.OAuthHandler('API_KEY', 'API_SECRET_KEY')
auth.set_access_token('ACCESS_TOKEN', 'ACCESS_SECRET')
api = tweepy.API(auth)

# Fetch tweets based on a keyword

tweets = api.search(q="product review", lang="en", count=100)

# Initialize VADER sentiment analyzer

analyzer = SentimentIntensityAnalyzer()

# Analyze sentiment of each tweet

for tweet in tweets:
text = tweet.text
sentiment = analyzer.polarity_scores(text)
print(f"Tweet: {text} | Sentiment: {sentiment['compound']}")

Applications of Twitter Sentiment Analysis:

● Brand monitoring: Understanding public sentiment about a product or service.

● Political sentiment analysis: Gauging public opinion on political issues.
● Market research: Analyzing customer feedback to improve products.

Visualizing Twitter Sentiment Results

Once the tweets are processed and classified, you can plot the sentiment distribution using
matplotlib or seaborn to visualize whether the majority of tweets are positive, negative, or
neutral.

Key Concepts Recap and Differences

Concept Definition Example Tools/Techniques

Word Vectors High-dimensional Word2Vec, GloVe Word2Vec, FastText

numeric
representations of
words.
Sentiment Classifies text based Positive or negative NLTK (VADER), Machine
Analysis on emotional tone. movie reviews Learning (Logistic
Regression)

Named Entity Identifies and Recognizing SpaCy, NLTK

Recognition classifies proper people, locations,
nouns. dates in a
sentence.

Text Condenses text into Summarizing an TF-IDF, Word

Summarization shorter, meaningful article into 3-4 key Frequencies
summaries. sentences

Unit 4: Semantics and Sentiment Analysis - Continued

4.7 Sentiment Analysis Movie Review Project (Expanded)

Objective: The goal of this project is to classify movie reviews as either positive or negative
using machine learning techniques. In this section, we'll break down the project pipeline into
detailed steps with relevant code, insights, and explanations.

Steps for Movie Review Sentiment Classification:

1. Data Collection:
○ We'll use the IMDb movie review dataset, a commonly used dataset for
sentiment analysis.
○ This dataset contains reviews labeled as positive or negative, which helps train
a classification model.

python

import pandas as pd

# Load IMDb dataset (CSV file with reviews and sentiment labels)
data = pd.read_csv('IMDB_Dataset.csv')

# Inspect the first few rows of the data

print(data.head())
2. Sample Data:
Review Sentiment
"I loved this movie! The acting was amazing and the story was gripping." Positive

"This was the worst movie I have ever seen. I regret watching it." Negative

"An absolute masterpiece with brilliant performances by the entire cast." Positive

"Terrible plot, bad acting, and a complete waste of time. Avoid this movie Negative
at all costs."

3. Preprocessing:
○ Before training the model, we need to clean the data:
■ Lowercasing: Convert all text to lowercase to ensure uniformity.
■ Removing Punctuation: Strip out punctuation marks that don’t carry
meaning.
■ Tokenization: Split text into individual words or tokens.
■ Stop Word Removal: Remove common words like "the", "is", "in", which
don’t contribute to sentiment.

python

from nltk.corpus import stopwords

import re
from nltk.tokenize import word_tokenize

# Preprocess function to clean and tokenize text

def preprocess_text(text):
# Remove punctuation and convert to lowercase
text = re.sub(r'[^\w\s]', '', text.lower())
# Tokenize
tokens = word_tokenize(text)
# Remove stop words
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]
return ' '.join(tokens)

# Apply preprocessing to the dataset

data['cleaned_review'] = data['review'].apply(preprocess_text)
4. Example of Preprocessing: Original Review: "I loved this movie! The acting
was amazing and the story was gripping." Preprocessed: "loved movie
acting amazing story gripping"
5. Vectorization (Converting Text to Numerical Form):
○ We’ll use the TF-IDF (Term Frequency-Inverse Document Frequency) model
to convert the text into a numerical format that machine learning algorithms can
work with.

python

from sklearn.feature_extraction.text import TfidfVectorizer

# Convert text data into TF-IDF vectors

tfidf = TfidfVectorizer(max_features=5000) # Limit the vocabulary to
5000 words
X = tfidf.fit_transform(data['cleaned_review'])
y = data['sentiment'].map({'positive': 1, 'negative': 0}) # Map
'positive' to 1 and 'negative' to 0

6. Splitting the Data:

○ We divide the dataset into a training set (to train the model) and a test set (to
evaluate its performance).

python

from sklearn.model_selection import train_test_split

# Split the data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

7. Training the Model:

○ We’ll use Logistic Regression as our classification model. Logistic regression is
well-suited for binary classification tasks.

python

from sklearn.linear_model import LogisticRegression

# Train the Logistic Regression classifier

classifier = LogisticRegression()
classifier.fit(X_train, y_train)

8. Testing and Evaluating the Model:

○ After training, we’ll evaluate the model using metrics like accuracy, precision,
recall, and F1-score.

python

from sklearn.metrics import accuracy_score, classification_report

# Predict on the test set

y_pred = classifier.predict(X_test)

# Evaluate the model

print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred, target_names=['Negative',
'Positive']))
Output Example:

Accuracy: 0.87
precision recall f1-score support

Negative 0.88 0.86 0.87 245

Positive 0.86 0.88 0.87 255

accuracy 0.87 500

4.8 Twitter Sentiment Analysis (Expanded)

Objective: Analyze the sentiment of real-time tweets on a specific topic or hashtag using the
Twitter API and classify them as positive, negative, or neutral.

Steps for Twitter Sentiment Analysis:

1. Setting Up the Twitter API:

○ First, create a Twitter Developer Account and get access tokens and API keys.
○ Use the Tweepy library to authenticate and fetch tweets.
python

import tweepy

# Replace these with your own API keys from Twitter Developer account
API_KEY = 'your_api_key'
API_SECRET = 'your_api_secret'
ACCESS_TOKEN = 'your_access_token'
ACCESS_SECRET = 'your_access_secret'

# Authenticate using Tweepy

auth = tweepy.OAuthHandler(API_KEY, API_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)
api = tweepy.API(auth)

2.
3. Fetching Real-Time Tweets:
○ We can fetch tweets based on specific hashtags or keywords.

python

# Fetch tweets based on a hashtag

keyword = "#AI"
tweets = api.search(q=keyword, lang="en", count=100)

# Print the text of the first 5 tweets

for tweet in tweets[:5]:
print(tweet.text)

4. Preprocessing Tweets:
○ Just like the movie review dataset, we preprocess tweets to remove unnecessary
characters such as URLs, hashtags, mentions, and punctuation.

python

def preprocess_tweet(text):
text = re.sub(r"http\S+", "", text) # Remove URLs
text = re.sub(r"#\w+", "", text) # Remove hashtags
text = re.sub(r"@\w+", "", text) # Remove mentions
text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
return text.lower()
# Preprocess tweets
cleaned_tweets = [preprocess_tweet(tweet.text) for tweet in tweets]

5. Sentiment Analysis of Tweets:

○ We’ll use the VADER sentiment analyzer from NLTK to classify the sentiment of
each tweet as positive, negative, or neutral.

python

from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Initialize the VADER sentiment analyzer

sid = SentimentIntensityAnalyzer()

# Analyze the sentiment of each tweet

for tweet in cleaned_tweets:
sentiment = sid.polarity_scores(tweet)
print(f"Tweet: {tweet} | Sentiment: {sentiment['compound']}")

6. Visualizing Sentiment Distribution:

○ You can use matplotlib to visualize the distribution of sentiments (positive,
negative, neutral) across the tweets.

python

import matplotlib.pyplot as plt

# Count the number of positive, neutral, and negative sentiments

sentiment_counts = {'positive': 0, 'neutral': 0, 'negative': 0}

for tweet in cleaned_tweets:

sentiment = sid.polarity_scores(tweet)
if sentiment['compound'] >= 0.05:
sentiment_counts['positive'] += 1
elif sentiment['compound'] <= -0.05:
sentiment_counts['negative'] += 1
else:
sentiment_counts['neutral'] += 1
# Plot the sentiment distribution
labels = ['Positive', 'Neutral', 'Negative']
sizes = [sentiment_counts['positive'], sentiment_counts['neutral'],
sentiment_counts['negative']]
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90,
colors=['green', 'gray', 'red'])
plt.axis('equal') # Equal aspect ratio ensures that pie chart is
drawn as a circle.
plt.title(f"Sentiment Distribution for {keyword} Tweets")
plt.show()

7. Applications of Twitter Sentiment Analysis:

● Brand Monitoring: Tracking how users perceive a brand or product based on real-time
feedback on social media.
● Political Sentiment: Analyzing public opinion on political issues or candidates.

Strategies For Unlocking Word Meanings
100% (1)
Strategies For Unlocking Word Meanings
3 pages
NLP_course-EDC-1-29
No ratings yet
NLP_course-EDC-1-29
29 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
NLP - Spacy Package
No ratings yet
NLP - Spacy Package
28 pages
NLP_Preprocessing_Steps__1740444240
No ratings yet
NLP_Preprocessing_Steps__1740444240
20 pages
TP1 3
No ratings yet
TP1 3
5 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
AP for NLP-LO1
No ratings yet
AP for NLP-LO1
61 pages
Text-Processing-For-NLP-Text-Processing (6)
No ratings yet
Text-Processing-For-NLP-Text-Processing (6)
15 pages
AP for NLP-Word 2 Vec
No ratings yet
AP for NLP-Word 2 Vec
33 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
Spacy Cheat Sheet Python For Data Science: Spans Visualizing
No ratings yet
Spacy Cheat Sheet Python For Data Science: Spans Visualizing
2 pages
Module2.4 Text Processing
No ratings yet
Module2.4 Text Processing
17 pages
NLP Preprocessing Steps
No ratings yet
NLP Preprocessing Steps
20 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
Session2 3
No ratings yet
Session2 3
18 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
NLP - 1_250119_222702 (1)
No ratings yet
NLP - 1_250119_222702 (1)
71 pages
7-text classification-13-11-2024
No ratings yet
7-text classification-13-11-2024
53 pages
Week 8-Module 7 NLP
No ratings yet
Week 8-Module 7 NLP
52 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
01 NLP - Merged Vinay
No ratings yet
01 NLP - Merged Vinay
27 pages
PPT for Assignment-10 (Machine Learning With Python_NLP-2)
No ratings yet
PPT for Assignment-10 (Machine Learning With Python_NLP-2)
37 pages
NLPAssignment Purna
No ratings yet
NLPAssignment Purna
12 pages
Big Data Finance t8 1 Choi Neoma NLP 2024
No ratings yet
Big Data Finance t8 1 Choi Neoma NLP 2024
13 pages
Text Preprocessing Stages
No ratings yet
Text Preprocessing Stages
8 pages
NLP Smitpatel
No ratings yet
NLP Smitpatel
32 pages
Natural Language Processing manual
No ratings yet
Natural Language Processing manual
39 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
17 pages
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
NLP - Cheatsheet
No ratings yet
NLP - Cheatsheet
10 pages
Unit 1
No ratings yet
Unit 1
4 pages
NLP 9
No ratings yet
NLP 9
44 pages
Lab8 Instructions
No ratings yet
Lab8 Instructions
36 pages
Unit V Natural Language Processing
No ratings yet
Unit V Natural Language Processing
20 pages
Introduction to NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
No ratings yet
Introduction to NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
35 pages
02 - NLP Pipeline - Binh
No ratings yet
02 - NLP Pipeline - Binh
37 pages
Handling Corpus Raw Text
No ratings yet
Handling Corpus Raw Text
15 pages
CH2
No ratings yet
CH2
119 pages
Understanding Language Model
No ratings yet
Understanding Language Model
5 pages
Jal Patel NLP
No ratings yet
Jal Patel NLP
32 pages
Tokenization
No ratings yet
Tokenization
4 pages
TextMining
No ratings yet
TextMining
43 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
No ratings yet
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
11 pages
NLP Practicals
No ratings yet
NLP Practicals
6 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
Dsbdal A7
No ratings yet
Dsbdal A7
65 pages
NLP Intro
No ratings yet
NLP Intro
74 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
nlp_1
No ratings yet
nlp_1
11 pages
Minorproject Ishant
No ratings yet
Minorproject Ishant
18 pages
4.TWITTER EXTRACTION AND ANALYTICS
No ratings yet
4.TWITTER EXTRACTION AND ANALYTICS
45 pages
NLP-Unit-1-part1
No ratings yet
NLP-Unit-1-part1
61 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
Top 30 NLP Interview Questions and Answers: 1. What Do You Understand by Natural Language Processing?
No ratings yet
Top 30 NLP Interview Questions and Answers: 1. What Do You Understand by Natural Language Processing?
18 pages
Quick Python Guide
From Everand
Quick Python Guide
Coder1
No ratings yet
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Computer Programming The Doctrine
From Everand
Computer Programming The Doctrine
Adesh Silva
No ratings yet
Answers Grammar 6
No ratings yet
Answers Grammar 6
29 pages
Unit 07 Teachers Notes
No ratings yet
Unit 07 Teachers Notes
9 pages
Lesson 22
No ratings yet
Lesson 22
15 pages
Inglés B1: Sin Requisitos de Acceso Modalidad: Duración: - 100 Horas
No ratings yet
Inglés B1: Sin Requisitos de Acceso Modalidad: Duración: - 100 Horas
13 pages
7 Учебный План - Английский Язык - 6 Класс
No ratings yet
7 Учебный План - Английский Язык - 6 Класс
105 pages
Grade 7 Set1 Rationalised Kiswahili Term 3 KLB
No ratings yet
Grade 7 Set1 Rationalised Kiswahili Term 3 KLB
33 pages
Penilaian Akhir Semester Genap Kelas X
No ratings yet
Penilaian Akhir Semester Genap Kelas X
2 pages
Giáo án tiếng anh 7 Global hai cột bản chuẩn English 7-UNIT 4
No ratings yet
Giáo án tiếng anh 7 Global hai cột bản chuẩn English 7-UNIT 4
47 pages
9 Present Perfect Tense
No ratings yet
9 Present Perfect Tense
6 pages
Blockchain and Smart-Contract Technologies For Innovative Applications 1st Edition Nour El Madhoun All Chapter Instant Download
100% (5)
Blockchain and Smart-Contract Technologies For Innovative Applications 1st Edition Nour El Madhoun All Chapter Instant Download
49 pages
Grammar Tasks for Lesson 1 and 2-
No ratings yet
Grammar Tasks for Lesson 1 and 2-
4 pages
Untitled
No ratings yet
Untitled
10 pages
Actitudes y Creencias Frente Al Rotacismo Sevillano PDF
No ratings yet
Actitudes y Creencias Frente Al Rotacismo Sevillano PDF
92 pages
Seri Mulia Sarjana School Brunei Darussalam Mid Year Exam Pointers Primary 3A/ 3B/ 3C/ 3D Subject: Maths
100% (1)
Seri Mulia Sarjana School Brunei Darussalam Mid Year Exam Pointers Primary 3A/ 3B/ 3C/ 3D Subject: Maths
4 pages
Lecture4 Java
No ratings yet
Lecture4 Java
46 pages
Real Fast Spanish Top 50 Best Spanish Expressions
No ratings yet
Real Fast Spanish Top 50 Best Spanish Expressions
10 pages
Grade 2 EAGLE
No ratings yet
Grade 2 EAGLE
4 pages
Chapter 1. Studying Meaning
No ratings yet
Chapter 1. Studying Meaning
13 pages
Mosaic 1 Grammar and Vocab Starter Unit B
No ratings yet
Mosaic 1 Grammar and Vocab Starter Unit B
2 pages
Ejercicio de Ingles III
No ratings yet
Ejercicio de Ingles III
16 pages
Toki Mi
No ratings yet
Toki Mi
32 pages
Synchronous Session Rubric 2019 Rev 1
No ratings yet
Synchronous Session Rubric 2019 Rev 1
3 pages
2.1 - Elements of Grammar
No ratings yet
2.1 - Elements of Grammar
12 pages
Snapshot English as a second language secondary cycle one 1st Edition Cynthia Beyea All Chapters Instant Download
100% (7)
Snapshot English as a second language secondary cycle one 1st Edition Cynthia Beyea All Chapters Instant Download
60 pages
CURRICULUM MAP-5-4thquarter
No ratings yet
CURRICULUM MAP-5-4thquarter
1 page
Vocabulary Grammar Usage Sentence Structuring
No ratings yet
Vocabulary Grammar Usage Sentence Structuring
180 pages
Cause Effect Conjunction
No ratings yet
Cause Effect Conjunction
10 pages
Introduce Yourself and Personal Information: Inglés 1
No ratings yet
Introduce Yourself and Personal Information: Inglés 1
4 pages
Assessment Framework For Sea 2025 2028 - 231027 - 204955
No ratings yet
Assessment Framework For Sea 2025 2028 - 231027 - 204955
31 pages