Python and NLP Notes
Python and NLP Notes
Python offers numerous tools for handling and manipulating text. The most basic of these are
string operations, but for more advanced tasks, we use libraries like re for regular expressions
and Spacy for NLP-specific functions.
String operations in Python are efficient for simple text processing tasks, such as breaking a
sentence into words, converting text to lowercase, or replacing substrings.
Reading Reads the file line by line and file.readlines() List of lines in
Lines returns a list. file
Example:
Python libraries such as PyPDF2 and pdfminer.six are commonly used to extract text from
PDF documents.
Example:
import PyPDF2
Regular expressions allow us to define complex patterns to search, match, or manipulate text.
Python’s re module provides a variety of functions to work with regex.
Regex Description Example Code Output
Operation
import re
text = "Hello! Welcome to NLP 101."
clean_text = re.sub(r'[^A-Za-z\s]', '', text) # Removes anything that
is not a letter or space
print(clean_text) # Output: "Hello Welcome to NLP"
Preprocessing is the crucial first step in any NLP pipeline, ensuring that the data is cleaned and
normalized before being fed into algorithms.
Preprocessi Regex Pattern / Example Output
ng Task Operation
NLP involves enabling machines to understand, interpret, and generate human language. It
combines computer science, linguistics, and machine learning techniques.
Challenges in NLP
Ambiguity Words or sentences with multiple "The bank is on the river bank." (bank as
meanings. financial or riverbank)
Variety Variations in language use across British English vs. American English:
dialects, regions, etc. colour vs. color
Modern NLP relies heavily on machine learning, particularly deep learning, to automatically
detect patterns in language. The following models are popular in NLP:
Bag of Words (BoW) Text represented as a bag of 'I love NLP' → {'I':
individual words, ignoring order. 1, 'love': 1, 'NLP':
1}
Transformers Advanced model that captures Used in GPT, BERT for tasks
global context across sentences. like summarization
Spacy is a popular NLP library in Python, known for its efficiency and ease of use. Key features
include tokenization, part-of-speech tagging, and named entity recognition.
Tokenization
Example: Tokenization
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Natural Language Processing is exciting!")
tokens = [token.text for token in doc]
print(tokens) # Output: ['Natural', 'Language', 'Processing', 'is',
'exciting', '!']
Example: Lemmatization
Phrase matching is used to search for multi-word expressions in text, which are often significant
in NLP tasks like entity recognition or keyword extraction.
POS Tagging is the process of labeling each word in a sentence with its respective part of
speech, such as noun, verb, adjective, etc. POS tagging is a fundamental part of many NLP
tasks, including syntactic parsing and word-sense disambiguation.
import spacy
nlp = spacy.load("en_core_web_sm")
Example Output:
rust
Purpose Labels words as nouns, verbs, etc. Identifies proper nouns and classifies them
(e.g., person, organization)
Named Entity Recognition (NER) identifies and classifies named entities mentioned in the text
into predefined categories, such as persons, organizations, locations, dates, etc.
Example Verb (run), Noun (city) Person (Elon Musk), Organization (Apple),
GPE (U.K.)
Sentence segmentation is the process of splitting text into individual sentences. It is a critical
step in NLP for understanding sentence boundaries and structure.
Example Output:
Hello!
How are you?
I'm doing well.
The Bag of Words (BoW) model represents text data as a collection of words, ignoring grammar
and word order but maintaining frequency counts of each word.
I love NLP 1 1 1 0 0 0
NLP is amazing 0 0 1 1 1 0
I love programming 1 1 0 0 0 1
Advantages Limitations
Works well for simple text classification tasks Does not capture semantic meaning of words
2.5 Text Modeling using the TF-IDF Model
TF-IDF Formula:
TF-IDF Weighs words by frequency and Better for tasks where word significance
importance in the corpus. matters, like information retrieval.
An N-Gram is a contiguous sequence of n items (words, characters, etc.) from a given text.
N-Grams capture local context by analyzing adjacent words or characters.
Types of N-Grams
N-Gram Applications
Latent Semantic Analysis (LSA) is a technique in natural language processing that helps
discover the underlying structure of relationships between terms and documents. LSA reduces
the dimensionality of text data by transforming it into a lower-dimensional space using Singular
Value Decomposition (SVD). This technique is useful for text clustering, topic modeling, and
document similarity.
Steps in LSA:
Formula:
A=UΣVTA = U \Sigma V^TA=UΣVT
Where:
LSA Applications:
In NLP, synonyms are words that have similar meanings, while antonyms are words with
opposite meanings. The NLTK (Natural Language Toolkit) provides a built-in lexical database
called WordNet to fetch synonyms and antonyms for any word.
print(f"Synonyms: {set(synonyms)}")
print(f"Antonyms: {set(antonyms)}")
Example Output:
arduino
● Thesaurus generation.
● Word-sense disambiguation.
● Improving semantic search.
2.9 Word Negation Tracking
Word Negation Tracking refers to identifying and understanding negation in a sentence. Words
like "not", "never", "no", or "none" can drastically change the meaning of a sentence. Handling
negations is crucial for tasks like sentiment analysis or intent recognition.
import nltk
from nltk.tokenize import word_tokenize
def negate_sentence(sentence):
tokens = word_tokenize(sentence)
negation = False
result = []
# Example sentence
sentence = "I am not happy with the service."
negated_sentence = negate_sentence(sentence)
print(negated_sentence) # Output: 'I am NOT_happy with the service .'
Text classification is the process of assigning labels or categories to a piece of text based on its
content. This is widely used in tasks like sentiment analysis, spam detection, and topic
classification.
# Sample data
corpus = ["I love this product!", "This is the worst experience!",
"Absolutely fantastic service!", "Terrible customer support."]
labels = [1, 0, 1, 0] # 1=positive, 0=negative
Support Vector A robust linear classifier that finds the decision Fake news
Machines boundary. detection
Text summarization is the process of reducing the length of a document while preserving its key
information. There are two main types of text summarization: Extractive Summarization and
Abstractive Summarization.
● Extractive Summarization: Selects key sentences or phrases directly from the original
text.
● Abstractive Summarization: Generates new sentences that summarize the content
(like how humans summarize text).
1. Fetch and Preprocess the Data: Get the document and clean it.
2. Tokenization: Split the text into sentences.
3. Build a Histogram: Calculate the frequency of each word.
4. Calculate Sentence Scores: Score each sentence based on the significance of its
words.
5. Select Sentences: Choose top N sentences for the summary.
# Sample text
text = """Natural Language Processing is an exciting field of
Artificial Intelligence.
It enables machines to understand and process human language. It is
widely used in chatbots, language translation, and many other
applications."""
● Semantics in NLP deals with the meaning and interpretation of words, phrases,
sentences, and larger units of text. It helps understand context, disambiguate word
meanings, and identify relationships between entities.
● Sentiment Analysis focuses on determining the emotional tone behind a body of text,
identifying whether it is positive, negative, or neutral.
Word vectors (also known as word embeddings) are numerical representations of words in a
high-dimensional space. Words that share similar contexts in a corpus tend to be closer in this
vector space. Word vectors enable semantic analysis by capturing relationships like:
● Word2Vec: Converts words into dense vectors by training on large corpora using two
architectures: Continuous Bag of Words (CBOW) and Skip-Gram.
● GloVe (Global Vectors for Word Representation): Generates word embeddings by
factorizing word co-occurrence matrices.
● FastText: Builds on Word2Vec, but models sub-word information, making it better for
handling rare words.
Word2Vec Example:
python
import gensim
from gensim.models import Word2Vec
# Example corpus
sentences = [["I", "love", "NLP"], ["NLP", "is", "fun"], ["I",
"enjoy", "learning", "NLP"]]
Analogy Example:
python
Sentiment Analysis is the task of analyzing a piece of text to determine the underlying
sentiment or opinion. It can classify text as positive, negative, or neutral. Sentiment analysis is
widely used in:
Each word is assigned a sentiment score based on its polarity (positive or negative).
● "I love this product!" → Positive sentiment (words like "love" have positive polarity).
● "The service was terrible." → Negative sentiment (words like "terrible" have negative
polarity).
The Natural Language Toolkit (NLTK) provides tools for simple sentiment analysis. The
VADER (Valence Aware Dictionary for Sentiment Reasoning) model, built into NLTK, is
commonly used for lexicon-based sentiment analysis.
# Example sentence
sentence = "I love this product, but the delivery was terrible."
1. Data Collection: Use the IMDb movie review dataset, which contains reviews labeled as
positive or negative.
2. Preprocessing: Clean the text by removing stop words, punctuation, and performing
tokenization.
3. Feature Extraction: Convert text into numerical format using Bag of Words or TF-IDF.
4. Training the Model: Train a machine learning model such as Logistic Regression or
Naive Bayes.
5. Evaluating the Model: Evaluate the model using metrics such as accuracy, precision,
recall, and F1 score.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
Twitter Sentiment Analysis focuses on analyzing the sentiment of tweets in real-time. Since
tweets are brief and often contain informal language, they pose unique challenges for NLP. This
project involves fetching live tweets, processing them, and predicting their sentiment.
1. Set up the Twitter Application: Create a Twitter developer account and get access
tokens and API keys.
2. Fetch Real-Time Tweets: Use the tweepy library to fetch tweets based on specific
hashtags or keywords.
3. Preprocessing the Tweets: Clean tweets by removing URLs, hashtags, mentions, and
special characters.
4. Predicting Sentiment: Load a pre-trained sentiment analysis model (e.g., TF-IDF and
Logistic Regression) to classify the sentiment of each tweet.
5. Visualizing Results: Plot the distribution of sentiments (positive, negative, neutral).
import tweepy
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Set up Tweepy API with your credentials
auth = tweepy.OAuthHandler('API_KEY', 'API_SECRET_KEY')
auth.set_access_token('ACCESS_TOKEN', 'ACCESS_SECRET')
api = tweepy.API(auth)
Once the tweets are processed and classified, you can plot the sentiment distribution using
matplotlib or seaborn to visualize whether the majority of tweets are positive, negative, or
neutral.
Objective: The goal of this project is to classify movie reviews as either positive or negative
using machine learning techniques. In this section, we'll break down the project pipeline into
detailed steps with relevant code, insights, and explanations.
1. Data Collection:
○ We'll use the IMDb movie review dataset, a commonly used dataset for
sentiment analysis.
○ This dataset contains reviews labeled as positive or negative, which helps train
a classification model.
python
import pandas as pd
# Load IMDb dataset (CSV file with reviews and sentiment labels)
data = pd.read_csv('IMDB_Dataset.csv')
"This was the worst movie I have ever seen. I regret watching it." Negative
"An absolute masterpiece with brilliant performances by the entire cast." Positive
"Terrible plot, bad acting, and a complete waste of time. Avoid this movie Negative
at all costs."
3. Preprocessing:
○ Before training the model, we need to clean the data:
■ Lowercasing: Convert all text to lowercase to ensure uniformity.
■ Removing Punctuation: Strip out punctuation marks that don’t carry
meaning.
■ Tokenization: Split text into individual words or tokens.
■ Stop Word Removal: Remove common words like "the", "is", "in", which
don’t contribute to sentiment.
python
python
python
# Split the data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
python
python
Accuracy: 0.87
precision recall f1-score support
Objective: Analyze the sentiment of real-time tweets on a specific topic or hashtag using the
Twitter API and classify them as positive, negative, or neutral.
import tweepy
# Replace these with your own API keys from Twitter Developer account
API_KEY = 'your_api_key'
API_SECRET = 'your_api_secret'
ACCESS_TOKEN = 'your_access_token'
ACCESS_SECRET = 'your_access_secret'
2.
3. Fetching Real-Time Tweets:
○ We can fetch tweets based on specific hashtags or keywords.
python
4. Preprocessing Tweets:
○ Just like the movie review dataset, we preprocess tweets to remove unnecessary
characters such as URLs, hashtags, mentions, and punctuation.
python
def preprocess_tweet(text):
text = re.sub(r"http\S+", "", text) # Remove URLs
text = re.sub(r"#\w+", "", text) # Remove hashtags
text = re.sub(r"@\w+", "", text) # Remove mentions
text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
return text.lower()
# Preprocess tweets
cleaned_tweets = [preprocess_tweet(tweet.text) for tweet in tweets]
python
python