Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views

Machine Learning Code Explanation

The document provides a detailed explanation of a machine learning code that involves data loading, preprocessing, model training, and evaluation. It includes libraries for data manipulation, natural language processing, and building neural network models, specifically using TensorFlow and Keras. The code also implements techniques like TF-IDF vectorization, LSTM modeling, and aspect-based sentiment analysis.

Uploaded by

Nilesh Pawar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Machine Learning Code Explanation

The document provides a detailed explanation of a machine learning code that involves data loading, preprocessing, model training, and evaluation. It includes libraries for data manipulation, natural language processing, and building neural network models, specifically using TensorFlow and Keras. The code also implements techniques like TF-IDF vectorization, LSTM modeling, and aspect-based sentiment analysis.

Uploaded by

Nilesh Pawar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Machine Learning Code Explanation

i am giving you the code for machine learning, you have to explain me it cell wise properly in detail

Sure, I'd be happy to help you understand your machine learning code in detail! Please go ahead and
provide the code you'd like me to explain. Once you share it, I can break it down cell by cell, clarifying
what each part does and how it contributes to the overall functionality of your machine learning project.

If you have any specific areas you're particularly curious about or if there are certain cells you find
challenging, feel free to let me know, and I'll focus on those as well.
Looking forward to assisting you!

# %%
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping

# Libraries for NLP preprocessing


import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

from matplotlib import pyplot as plt

# %%
def load_data():
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
return train_df, test_df

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 1/33
train_df, test_df = load_data()

# %%
train_df

# %%
test_df

# %%
# Display basic information about the datasets
print("Train Dataset Info:")
print(train_df.info())

# %%
print("Test Dataset Info:")
print(test_df.info())

# %%
# Check for missing values in train and test datasets
print("Missing values in Train Dataset:")
print(train_df.isnull().sum())

# Check for missing values in train and test datasets


print("Missing values in Test Dataset:")
print(test_df.isnull().sum())

# %%
import seaborn as sns

plt.figure(figsize=(8, 5))
sns.countplot(data=train_df, x='Browser_Used', order=train_df['Browser_Used'].value_counts().index)
plt.title('Distribution of Browsers Used')
plt.xlabel('Browser')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

# %%
plt.figure(figsize=(8, 5))
sns.countplot(data=train_df, x='Device_Used', order=train_df['Device_Used'].value_counts().index)
plt.title('Distribution of Device Types')
plt.xlabel('Device Type')
plt.ylabel('Count')
plt.show()

# %%
plt.figure(figsize=(8, 5))
sns.countplot(data=train_df, x='Emotion', order=train_df['Emotion'].value_counts().index)
plt.title('Distribution of Emotions')
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 2/33
plt.xlabel('Emotion')
plt.ylabel('Count')
plt.show()

# %%
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Generate word clouds for Happy and Not Happy reviews


def generate_wordclouds(df, emotion_col, text_col):
# Filter data for Happy and Not Happy sentiments
happy_text = " ".join(review for review in df[df[emotion_col] == "happy"][text_col].astype(str))
not_happy_text = " ".join(review for review in df[df[emotion_col] == "not happy"]
[text_col].astype(str))

# Generate word clouds


happy_wordcloud = WordCloud(width=800, height=400,
background_color='white').generate(happy_text)
not_happy_wordcloud = WordCloud(width=800, height=400,
background_color='white').generate(not_happy_text)

# Plot the word clouds


plt.figure(figsize=(16, 8))

# Word cloud for Happy


plt.subplot(1, 2, 1)
plt.imshow(happy_wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud for Happy Descriptions')

# Word cloud for Not Happy


plt.subplot(1, 2, 2)
plt.imshow(not_happy_wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud for Not Happy Descriptions')

plt.show()

# Generate word clouds


generate_wordclouds(train_df, emotion_col='Emotion', text_col='Description')

# %%
# Pre-compile regex and use a set for stopwords
stop_words = set(stopwords.words('english'))
pattern = re.compile(r"[^a-zA-Z ]")

def clean_text_optimized(text):
# Remove special characters and numbers
text = pattern.sub("", text)
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 3/33
# Convert to lowercase and split
tokens = text.lower().split()
# Remove stopwords
tokens = [word for word in tokens if word not in stop_words]
return " ".join(tokens)

# Apply cleaning with list comprehension for faster execution


train_df['cleaned_description'] = [clean_text_optimized(text) for text in train_df['Description']]
test_df['cleaned_description'] = [clean_text_optimized(text) for text in test_df['Description']]

# %%
train_df

# %%
test_df

# %%
# Compute class weights
class_weights = compute_class_weight(
class_weight='balanced',
classes=np.unique(train_df['Emotion']),
y=train_df['Emotion']
)
class_weight_dict = {i: class_weights[i] for i in range(len(class_weights))}
print("Class Weights:", class_weight_dict)

# %%
# Convert Emotion labels to numeric
emotion_mapping = {'happy': 1, 'not happy': 0}
train_labels = train_df['Emotion'].map(emotion_mapping).values

# %%
# Split train data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(
train_df['cleaned_description'], train_df['Emotion'], test_size=0.2, random_state=42
)

# %%
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, accuracy_score

# TF-IDF Vectorization
print("Performing TF-IDF Vectorization...")
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train).toarray()
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 4/33
X_val_tfidf = tfidf_vectorizer.transform(X_val).toarray()

# Random Forest Model


print("Training Random Forest...")
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_tfidf, y_train)
rf_predictions = rf_model.predict(X_val_tfidf)

print("Random Forest Classification Report:")


print(classification_report(y_val, rf_predictions, target_names=['not happy', 'happy']))
print("Random Forest Accuracy:", accuracy_score(y_val, rf_predictions))

# %%
# Confusion Matrix
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Generate confusion matrix


cm = confusion_matrix(y_val, rf_predictions)

# Plot confusion matrix


plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['not happy', 'happy'], yticklabels=
['not happy', 'happy'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix for Random Forest')
plt.show()

# %%
# Tokenize and Pad Sequences for LSTM
print("\\nPreparing data for LSTM...")
max_words = 5000
max_len = 100
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train)

X_train_sequences = tokenizer.texts_to_sequences(X_train)
X_val_sequences = tokenizer.texts_to_sequences(X_val)

X_train_padded = pad_sequences(X_train_sequences, maxlen=max_len, padding='post')


X_val_padded = pad_sequences(X_val_sequences, maxlen=max_len, padding='post')

# %%
def build_lstm_model(vocab_size, embedding_dim=128, input_length=100):
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim,
input_length=input_length))
model.add(LSTM(128, return_sequences=True))
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 5/33
model.add(Dropout(0.2))
model.add(LSTM(64))
model.add(Dropout(0.2))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid')) # Binary classification

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


return model

vocab_size = min(len(tokenizer.word_index) + 1, max_words)


model = build_lstm_model(vocab_size, embedding_dim=128, input_length=max_len)
model.summary()

# %%
# Map string labels to numeric values
emotion_mapping = {'happy': 1, 'not happy': 0}
y_train = pd.Series(y_train).map(emotion_mapping) # Ensure it's a Pandas Series
y_val = pd.Series(y_val).map(emotion_mapping)

# Ensure labels are numeric


y_train = np.array(y_train, dtype='float32')
y_val = np.array(y_val, dtype='float32')

# Split data into training and validation sets


X_train, X_val, y_train, y_val = train_test_split(X_train_padded, y_train, test_size=0.2,
random_state=42)

# Early stopping to prevent overfitting


early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# Train the model


history = model.fit(
X_train, y_train,
validation_data=(X_val, y_val),
batch_size=32,
epochs=20,
callbacks=[early_stopping]
)

# %%
from sklearn.metrics import classification_report
# Predict on the validation set
y_val_pred_probs = model.predict(X_val)
y_val_preds = (y_val_pred_probs >= 0.5).astype(int) # Convert probabilities to binary predictions

# Generate classification report


print("Classification Report:")
print(classification_report(y_val, y_val_preds, target_names=['not happy', 'happy']))
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 6/33
# %%
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Generate confusion matrix


cm = confusion_matrix(y_val, y_val_preds)

# Plot confusion matrix


plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['not happy', 'happy'], yticklabels=
['not happy', 'happy'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix for LSTM Model')
plt.show()

# %%
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

# %%
pattern = re.compile(r"[^a-zA-Z ]")
stop_words = set(stopwords.words('english'))

def clean_text(text):
text = pattern.sub("", text) # Remove special characters and numbers
tokens = text.lower().split() # Convert to lowercase and tokenize
tokens = [word for word in tokens if word not in stop_words] # Remove stopwords
return " ".join(tokens)

# %%
def detect_aspects(review):
aspects = {}
review_lower = review.lower()
if any(keyword in review_lower for keyword in ["room", "bed", "suite"]):
aspects['room'] = review
if any(keyword in review_lower for keyword in ["service", "staff", "reception"]):
aspects['service'] = review
return aspects

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 7/33
# %%
# Function to predict sentiment for each aspect
def predict_aspect_sentiment(review):
aspects = detect_aspects(review)
results = {}
for aspect, text in aspects.items():
cleaned_text = clean_text(text)
sequence = tokenizer.texts_to_sequences([cleaned_text])
padded_sequence = pad_sequences(sequence, maxlen=max_len, padding='post')
prediction = model.predict(padded_sequence)
results[aspect] = 'happy' if prediction >= 0.5 else 'not happy'
return results

# %%
# Function to predict sentiment for multiple reviews
def predict_aspect_sentiments_multiple(reviews):
results = [] # Store the results for each review
for review in reviews:
# Detect aspect-based sentiments
predicted_sentiments = predict_aspect_sentiment(review)

# Determine overall sentiment


overall_sentiment = (
'happy' if list(predicted_sentiments.values()).count('happy') >
list(predicted_sentiments.values()).count('not happy') else 'not happy'
)

# Append the results for the current review


results.append({
'Review': review,
'Aspect-Based Sentiments': predicted_sentiments,
'Overall Sentiment': overall_sentiment
})
return results

# Example reviews
reviews = [
"The room was clean and spacious, and the service was good",
"room was very bad.",
"The service was very bad and room was also bad."
]

# Predict sentiments for multiple reviews


review_analysis = predict_aspect_sentiments_multiple(reviews)

# Print the results for each review


for analysis in review_analysis:
print(f"Review: {analysis['Review']}")
print(f"Aspect-Based Sentiments: {analysis['Aspect-Based Sentiments']}")
print(f"Overall Sentiment: {analysis['Overall Sentiment']}")
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 8/33
print("---")

# %%

Certainly! Let's break down your machine learning code cell by cell to understand each part in detail.
This will help you grasp the overall workflow, from data loading and preprocessing to model training
and evaluation.

Cell 1: Importing Libraries


python

# %%
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping

# Libraries for NLP preprocessing


import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

from matplotlib import pyplot as plt

Explanation:

Purpose: This cell imports all the necessary libraries and modules required for data manipulation,
preprocessing, model building, and visualization.

Libraries Imported:
Pandas (`pd`): For data manipulation and analysis.

NumPy (`np`): For numerical operations.

Regular Expressions (`re`): For text preprocessing.


Scikit-learn Modules:
`train_test_split`: To split data into training and validation sets.
`compute_class_weight`: To handle class imbalance by computing class weights.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 9/33
TensorFlow Keras Modules:
`Tokenizer`and `pad_sequences`: For text tokenization and sequence padding.
`Sequential`, `Embedding`, `LSTM`, `Dense`, `Dropout`: For building the neural network
model.
`to_categorical`: To convert labels to categorical format (one-hot encoding).
`EarlyStopping`: To prevent overfitting by stopping training when validation
performance stops improving.
NLTK Modules:
`stopwords`, `word_tokenize`, `WordNetLemmatizer`:
For natural language processing tasks
like removing stopwords, tokenization, and lemmatization.
Matplotlib (`plt`): For plotting and visualization.
NLTK Downloads:
Downloads necessary NLTK data files (`stopwords`, `punkt`, `wordnet`) required for
preprocessing.

Cell 2: Loading the Data


python

# %%
def load_data():
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
return train_df, test_df

train_df, test_df = load_data()

Explanation:

Purpose: This cell defines a function to load the training and testing datasets from CSV files and
then calls this function to load the data into `train_df` and `test_df`.

Function `load_data()`:

`pd.read_csv("train.csv")`: Reads the training data from `train.csv` into a Pandas


DataFrame `train_df`.

`pd.read_csv("test.csv")`: Reads the testing data from `test.csv` into a Pandas DataFrame
`test_df`.

Returns: Both DataFrames.

DataFrames:
`train_df`: Contains the training data with features and labels.

`test_df`: Contains the testing data, typically without labels.

Cell 3: Display Training Data

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 10/33
python

# %%
train_df

Explanation:

Purpose: This cell simply displays the `train_df` DataFrame. In a Jupyter notebook, executing this
cell will render the DataFrame as a table, allowing you to inspect the first few rows, columns, and
data types.

Usage: Useful for getting an initial look at the structure and contents of the training dataset.

Cell 4: Display Testing Data


python

# %%
test_df

Explanation:

Purpose: Similar to the previous cell, this displays the `test_df` DataFrame.

Usage: Allows you to inspect the testing dataset's structure and contents, ensuring it has been
loaded correctly.

Cell 5: Basic Information About Training Dataset


python

# %%
# Display basic information about the datasets
print("Train Dataset Info:")
print(train_df.info())

Explanation:

Purpose: Provides a summary of the training dataset, including the number of entries, column
names, non-null counts, and data types.
`train_df.info()`: A Pandas method that prints a concise summary of the DataFrame.

Usage: Helps identify data types, detect missing values, and understand the structure of the
dataset.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 11/33
Cell 6: Basic Information About Testing Dataset
python

# %%
print("Test Dataset Info:")
print(test_df.info())

Explanation:

Purpose: Similar to the previous cell, this provides a summary of the testing dataset.

Usage: Useful for verifying that the testing data has been loaded correctly and to check for any
inconsistencies or missing data.

Cell 7: Checking for Missing Values


python

# %%
# Check for missing values in train and test datasets
print("Missing values in Train Dataset:")
print(train_df.isnull().sum())

# Check for missing values in train and test datasets


print("Missing values in Test Dataset:")
print(test_df.isnull().sum())

Explanation:

Purpose: Identifies the number of missing (null) values in each column of both the training and
testing datasets.

Methods Used:
`train_df.isnull()`: Returns a DataFrame of the same shape with `True` where values are
`NaN`.

`.sum()`:Aggregates the `True` values per column, effectively counting the number of
missing values.
Output: Lists the count of missing values for each column, which is crucial for deciding how to
handle them (e.g., imputation, removal).

Cell 8: Distribution of Browsers Used


python

# %%
import seaborn as sns

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 12/33
plt.figure(figsize=(8, 5))
sns.countplot(data=train_df, x='Browser_Used', order=train_df['Browser_Used'].value_counts
().index)
plt.title('Distribution of Browsers Used')
plt.xlabel('Browser')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

Explanation:

Purpose: Visualizes the distribution of different browsers used, as recorded in the `Browser_Used`
column of the training dataset.

Libraries Used:
Seaborn (`sns`): For creating aesthetically pleasing statistical plots.
Matplotlib (`plt`): For plotting and customizing the visualization.

Plot Details:
`countplot`:Creates a bar plot showing the count of occurrences for each category in
`Browser_Used`.

`order`: Sorts the bars based on the frequency of each browser.


`figsize`: Sets the size of the plot for better visibility.
`xticks(rotation=45)`: Rotates the x-axis labels by 45 degrees to prevent overlap if labels are
long.
Usage: Helps in understanding which browsers are most commonly used in the dataset, which
could be relevant for analysis or feature engineering.

Cell 9: Distribution of Device Types


python

# %%
plt.figure(figsize=(8, 5))
sns.countplot(data=train_df, x='Device_Used', order=train_df['Device_Used'].value_counts().i
ndex)
plt.title('Distribution of Device Types')
plt.xlabel('Device Type')
plt.ylabel('Count')
plt.show()

Explanation:

Purpose: Similar to the previous cell, this visualizes the distribution of different device types used,
based on the `Device_Used` column.
Plot Details:
`countplot`: Shows the number of occurrences for each device type.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 13/33
`order`: Sorts the device types based on their frequency.
`figsize`: Sets the plot size.
No rotation of x-axis labels: Assumes device type labels are short enough to fit without
rotation.
Usage: Provides insights into the variety and frequency of device types in the dataset, which may
influence user behavior or sentiment.

Cell 10: Distribution of Emotions


python

# %%
plt.figure(figsize=(8, 5))
sns.countplot(data=train_df, x='Emotion', order=train_df['Emotion'].value_counts().index)
plt.title('Distribution of Emotions')
plt.xlabel('Emotion')
plt.ylabel('Count')
plt.show()

Explanation:

Purpose: Visualizes the distribution of emotions (e.g., 'happy', 'not happy') in the training dataset.
Plot Details:

`countplot`: Displays the number of samples for each emotion category.


`order`: Sorts the emotions by their frequency.
`figsize`: Sets the size of the plot.

Usage: Essential for understanding class distribution, which is crucial for handling class imbalance
during model training.

Cell 11: Generating Word Clouds


python

# %%
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Generate word clouds for Happy and Not Happy reviews


def generate_wordclouds(df, emotion_col, text_col):
# Filter data for Happy and Not Happy sentiments
happy_text = " ".join(review for review in df[df[emotion_col] == "happy"][text_col].asty
pe(str))
not_happy_text = " ".join(review for review in df[df[emotion_col] == "not happy"][text_c
ol].astype(str))

# Generate word clouds


happy_wordcloud = WordCloud(width=800, height=400, background_color='white').generate(ha
ppy_text)
not_happy_wordcloud = WordCloud(width=800, height=400, background_color='white').generat
e(not_happy_text)

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 14/33
# Plot the word clouds
plt.figure(figsize=(16, 8))

# Word cloud for Happy


plt.subplot(1, 2, 1)
plt.imshow(happy_wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud for Happy Descriptions')

# Word cloud for Not Happy


plt.subplot(1, 2, 2)
plt.imshow(not_happy_wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud for Not Happy Descriptions')

plt.show()

# Generate word clouds


generate_wordclouds(train_df, emotion_col='Emotion', text_col='Description')

Explanation:

Purpose: Creates and displays word clouds for reviews labeled as "happy" and "not happy". Word
clouds visually represent the most frequent words in each category, giving an intuitive sense of the
prominent terms associated with each emotion.

Libraries Used:
WordCloud: To generate the word cloud images.
Matplotlib (`plt`): For plotting the word clouds.

Function `generate_wordclouds`:
Parameters:
`df`: The DataFrame containing the data.
`emotion_col`: The column name for emotions (e.g., 'Emotion').
`text_col`: The column name for the text descriptions (e.g., 'Description').
Process:
Filtering Text:
`happy_text`: Concatenates all descriptions labeled as "happy".
`not_happy_text`: Concatenates all descriptions labeled as "not happy".
Generating Word Clouds:
`WordCloud.generate()`: Creates the word cloud image from the concatenated text.
Plotting:
Creates a side-by-side plot with word clouds for "happy" and "not happy"
descriptions.
`plt.subplot(1, 2, 1)` and `plt.subplot(1, 2, 2)`: Defines a 1-row, 2-column grid
for plotting both word clouds side by side.
`plt.imshow()`: Displays the word cloud image.
`plt.axis('off')`: Hides the axis for a cleaner look.
Titles: Provides context for each word cloud.
Usage: Helps in understanding the common words and themes associated with each emotion,
which can be valuable for feature engineering and model interpretation.

Cell 12: Text Cleaning and Preprocessing


Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 15/33
python

# %%
# Pre-compile regex and use a set for stopwords
stop_words = set(stopwords.words('english'))
pattern = re.compile(r"[^a-zA-Z ]")

def clean_text_optimized(text):
# Remove special characters and numbers
text = pattern.sub("", text)
# Convert to lowercase and split
tokens = text.lower().split()
# Remove stopwords
tokens = [word for word in tokens if word not in stop_words]
return " ".join(tokens)

# Apply cleaning with list comprehension for faster execution


train_df['cleaned_description'] = [clean_text_optimized(text) for text in train_df['Descript
ion']]
test_df['cleaned_description'] = [clean_text_optimized(text) for text in test_df['Descriptio
n']]

Explanation:

Purpose: Cleans and preprocesses the text data in the `Description` column by removing
unwanted characters, converting text to lowercase, tokenizing, and removing stopwords. The
cleaned text is stored in a new column `cleaned_description`.
Components:
Stopwords:
`stopwords.words('english')`: Retrieves a list of common English stopwords (e.g., 'the',
'is', 'and').
`set(stop_words)`: Converts the list to a set for faster lookup during removal.
Regular Expression Pattern:
`re.compile(r"[^a-zA-Z ]")`: Compiles a regex pattern that matches any character that
is not an uppercase or lowercase letter or a space. This pattern is used to remove special
characters and numbers from the text.
Function `clean_text_optimized`:
`pattern.sub("", text)`: Removes all characters matching the regex pattern from the
text.
`text.lower().split()`: Converts the text to lowercase and splits it into tokens (words).
Stopword Removal: Filters out tokens that are present in the `stop_words` set.
Returns: A cleaned and space-separated string of tokens.
Applying the Cleaning Function:
List Comprehension:
`[clean_text_optimized(text) for text in train_df['Description']]`: Applies the
cleaning function to each description in the training data.
Similarly for `test_df`: Cleans the descriptions in the testing data.
`train_df['cleaned_description']` and `test_df['cleaned_description']`: New columns in the
DataFrames containing the cleaned text.
Optimization:
Pre-compiling Regex and Using Sets: Improves performance by compiling the regex pattern
once and using sets for faster stopword lookups, which is beneficial when processing large
datasets.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 16/33
Usage: Preprocessing text data is crucial for improving model performance by reducing noise and
focusing on meaningful words.

Cell 13: Display Training Data After Cleaning


python

# %%
train_df

Explanation:

Purpose: Displays the `train_df` DataFrame after adding the `cleaned_description` column. This
allows you to verify that the text cleaning process has been applied correctly.

Usage: Ensures that the new column `cleaned_description` contains the expected cleaned text
without special characters, numbers, and stopwords.

Cell 14: Display Testing Data After Cleaning


python

# %%
test_df

Explanation:

Purpose: Similar to the previous cell, this displays the `test_df` DataFrame after adding the
`cleaned_description` column.

Usage: Confirms that the testing data has been preprocessed correctly and that the
`cleaned_description` column is present and correctly formatted.

Cell 15: Computing Class Weights


python

# %%
# Compute class weights
class_weights = compute_class_weight(
class_weight='balanced',
classes=np.unique(train_df['Emotion']),
y=train_df['Emotion']
)
class_weight_dict = {i: class_weights[i] for i in range(len(class_weights))}
print("Class Weights:", class_weight_dict)

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 17/33
Explanation:

Purpose: Calculates class weights to handle class imbalance in the dataset. Class weights help the
model give appropriate importance to each class during training.
Components:
`compute_class_weight`:
`class_weight='balanced'`: Automatically adjusts weights inversely proportional to class
frequencies.
`classes=np.unique(train_df['Emotion'])`: Identifies unique classes in the `Emotion`
column.
`y=train_df['Emotion']`:
The target labels for which class weights are computed.
`class_weight_dict`: Creates a dictionary mapping each class index to its corresponding
weight.
`{i: class_weights[i] for i in range(len(class_weights))}`: Enumerates the class
weights for easy reference during model training.
Output:
Print Statement: Displays the computed class weights.
Usage: When training models, especially neural networks, providing class weights can help
improve performance on imbalanced datasets by preventing the model from being biased toward
the majority class.

Cell 16: Mapping Emotion Labels to Numeric Values


python

# %%
# Convert Emotion labels to numeric
emotion_mapping = {'happy': 1, 'not happy': 0}
train_labels = train_df['Emotion'].map(emotion_mapping).values

Explanation:

Purpose: Transforms categorical emotion labels into numeric values, which are required for
machine learning algorithms that operate on numerical data.

Components:
`emotion_mapping`: A dictionary that maps each emotion label to a numeric value.
'happy' → 1
'not happy' → 0
`train_df['Emotion'].map(emotion_mapping)`: Applies the mapping to the `Emotion` column,
converting categorical labels to numeric.

`.values`: Extracts the mapped values as a NumPy array, resulting in `train_labels`.


Usage: Numerical labels are essential for training classifiers, especially those that require
numerical input like logistic regression, SVMs, or neural networks.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 18/33
Cell 17: Splitting Data into Training and Validation Sets
python

# %%
# Split train data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(
train_df['cleaned_description'], train_df['Emotion'], test_size=0.2, random_state=42
)

Explanation:

Purpose: Divides the cleaned training data into training and validation subsets. The validation set
is used to evaluate the model's performance during training, helping to monitor overfitting and
generalization.

Components:
`train_test_split`: A Scikit-learn function that splits arrays or matrices into random train and
validation subsets.

Parameters:
`train_df['cleaned_description']`: The feature set containing the cleaned text
descriptions.
`train_df['Emotion']`:The target labels.
`test_size=0.2`: Allocates 20% of the data to the validation set and 80% to the training
set.
`random_state=42`: Ensures reproducibility by setting a seed for random number
generation.
Outputs:
`X_train`: Training features.
`X_val`: Validation features.
`y_train`: Training labels.
`y_val`: Validation labels.
Usage: Splitting the data helps in assessing the model's performance on unseen data, which is
crucial for evaluating its ability to generalize.

Cell 18: TF-IDF Vectorization and Random Forest Model Training


python

# %%
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, accuracy_score

# TF-IDF Vectorization
print("Performing TF-IDF Vectorization...")
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train).toarray()
X_val_tfidf = tfidf_vectorizer.transform(X_val).toarray()

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 19/33
# Random Forest Model
print("Training Random Forest...")
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_tfidf, y_train)
rf_predictions = rf_model.predict(X_val_tfidf)

print("Random Forest Classification Report:")


print(classification_report(y_val, rf_predictions, target_names=['not happy', 'happy']))
print("Random Forest Accuracy:", accuracy_score(y_val, rf_predictions))

Explanation:

Purpose: This cell performs feature extraction using TF-IDF vectorization and trains a Random
Forest classifier on the training data. It then evaluates the model's performance on the validation
set.
Components:
Imports:
`RandomForestClassifier`:An ensemble learning method for classification.
`TfidfVectorizer`:Converts text data into TF-IDF feature vectors.
`classification_report`, `accuracy_score`: Metrics for evaluating model performance.
TF-IDF Vectorization:
`TfidfVectorizer(max_features=5000)`: Initializes the vectorizer to consider the top 5000
most frequent words.
`fit_transform(X_train)`: Learns the vocabulary from the training data and transforms
the text into TF-IDF features.
`transform(X_val)`: Transforms the validation data using the learned vocabulary.
`.toarray()`: Converts the sparse matrix to a dense NumPy array for compatibility with
the classifier.
Random Forest Training:
`RandomForestClassifier(n_estimators=100, random_state=42)`: Initializes the classifier
with 100 decision trees and a fixed random state for reproducibility.
`fit(X_train_tfidf, y_train)`: Trains the model on the TF-IDF features and
corresponding labels.
Predictions and Evaluation:
`predict(X_val_tfidf)`: Generates predictions for the validation set.
`classification_report`: Provides precision, recall, f1-score, and support for each class.
`accuracy_score`: Calculates the overall accuracy of the model.
Output:
TF-IDF Vectorization Confirmation: Prints a message indicating that TF-IDF vectorization is
being performed.

Random Forest Training Confirmation: Prints a message indicating that the Random Forest
model is being trained.
Classification Report: Detailed performance metrics for each class.
Accuracy: Overall accuracy score on the validation set.

Usage: TF-IDF converts text data into numerical features suitable for machine learning models.
Random Forest is a robust classifier that can handle high-dimensional data and provides good
baseline performance.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 20/33
Cell 19: Confusion Matrix for Random Forest
python

# %%
# Confusion Matrix
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Generate confusion matrix


cm = confusion_matrix(y_val, rf_predictions)

# Plot confusion matrix


plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['not happy', 'happy'], ytick
labels=['not happy', 'happy'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix for Random Forest')
plt.show()

Explanation:

Purpose: Visualizes the performance of the Random Forest classifier using a confusion matrix,
which shows the number of correct and incorrect predictions for each class.
Components:
Imports:
`confusion_matrix`: Computes the confusion matrix.
Seaborn (`sns`) and Matplotlib (`plt`): For plotting the heatmap.
Generating the Confusion Matrix:
`confusion_matrix(y_val, rf_predictions)`: Compares the true labels (`y_val`) with the
predicted labels (`rf_predictions`) to create the matrix.
Plotting:
`sns.heatmap(cm, ...)`: Creates a heatmap from the confusion matrix.
`annot=True`: Annotates each cell with its numeric value.
`fmt='d'`: Formats the annotations as integers.
`cmap='Blues'`: Sets the color palette.
`xticklabels` and `yticklabels`: Labels the axes with class names.
Labels and Title: Adds labels for axes and a title for context.
Interpretation:
True Positives (TP): Correctly predicted "happy" reviews.

True Negatives (TN): Correctly predicted "not happy" reviews.


False Positives (FP): "Not happy" reviews incorrectly predicted as "happy".
False Negatives (FN): "Happy" reviews incorrectly predicted as "not happy".

Usage: Helps in understanding the types of errors the model is making, which is useful for further
model improvement.

Cell 20: Preparing Data for LSTM Model


Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 21/33
python

# %%
# Tokenize and Pad Sequences for LSTM
print("\\nPreparing data for LSTM...")
max_words = 5000
max_len = 100
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train)

X_train_sequences = tokenizer.texts_to_sequences(X_train)
X_val_sequences = tokenizer.texts_to_sequences(X_val)

X_train_padded = pad_sequences(X_train_sequences, maxlen=max_len, padding='post')


X_val_padded = pad_sequences(X_val_sequences, maxlen=max_len, padding='post')

Explanation:

Purpose: Prepares the text data for training a Long Short-Term Memory (LSTM) neural network by
tokenizing the text and padding the sequences to a uniform length.
Components:
Parameters:
`max_words = 5000`:The maximum number of words to keep, based on word frequency.
Only the top 5000 most frequent words will be considered.
`max_len = 100`: The maximum length of all sequences. Sequences shorter than this will
be padded, and longer ones will be truncated.
Tokenizer Initialization:
`Tokenizer(num_words=max_words)`: Initializes the tokenizer to consider only the top
`max_words` most frequent words.
Fitting the Tokenizer:
`tokenizer.fit_on_texts(X_train)`: Updates internal vocabulary based on the training
texts.
Text to Sequences:
`texts_to_sequences(X_train)`: Converts each text in `X_train` to a sequence of integers
(each integer represents a word's index in the tokenizer's vocabulary).
`texts_to_sequences(X_val)`: Similarly converts validation texts.
Padding Sequences:
`pad_sequences(..., maxlen=max_len, padding='post')`: Pads sequences with zeros at the
end (`padding='post'`) to ensure all sequences are of length `max_len`. Shorter
sequences are padded, and longer sequences are truncated to `max_len`.
Outputs:
`X_train_padded`: Padded sequences for the training set.
`X_val_padded`: Padded sequences for the validation set.

Usage: LSTM models require input sequences of uniform length. Tokenization converts text into
numerical data, and padding ensures consistency in sequence lengths, enabling efficient batch
processing.

Cell 21: Building and Summarizing the LSTM Model

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 22/33
python

# %%
def build_lstm_model(vocab_size, embedding_dim=128, input_length=100):
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=input_l
ength))
model.add(LSTM(128, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(64))
model.add(Dropout(0.2))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid')) # Binary classification

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


return model

vocab_size = min(len(tokenizer.word_index) + 1, max_words)


model = build_lstm_model(vocab_size, embedding_dim=128, input_length=max_len)
model.summary()

Explanation:

Purpose: Defines and compiles an LSTM-based neural network model for binary text classification
(i.e., predicting "happy" vs. "not happy" sentiments).
Function `build_lstm_model`:
Parameters:
`vocab_size`:Size of the vocabulary (number of unique tokens). Calculated as the
minimum between the actual vocabulary size and `max_words`.
`embedding_dim=128`: Dimension of the embedding vectors.
`input_length=100`: Length of input sequences.
Model Architecture:
1. Embedding Layer:
`Embedding(input_dim=vocab_size, output_dim=embedding_dim,
input_length=input_length)`: Converts integer sequences into dense vectors of
fixed size (`embedding_dim`).
2. First LSTM Layer:
`LSTM(128, return_sequences=True)`: An LSTM layer with 128 units that returns the
full sequence of outputs, allowing stacking of another LSTM layer.
3. Dropout Layer:
`Dropout(0.2)`: Randomly sets 20% of inputs to zero during training to prevent
overfitting.
4. Second LSTM Layer:
`LSTM(64)`: Another LSTM layer with 64 units that returns the last output in the
output sequence.
5. Second Dropout Layer:
`Dropout(0.2)`: Another dropout layer with 20% dropout rate.
6. Dense Layers:
`Dense(32, activation='relu')`: A fully connected layer with 32 units and ReLU
activation.
`Dense(1, activation='sigmoid')`: The output layer with a single neuron and
sigmoid activation for binary classification.
Compilation:
`optimizer='adam'`: Uses the Adam optimization algorithm.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 23/33
`loss='binary_crossentropy'`: Binary cross-entropy loss function, suitable for binary
classification.
`metrics=['accuracy']`:
Tracks accuracy during training.
Returns: The compiled Sequential model.
Building the Model:
`vocab_size = min(len(tokenizer.word_index) + 1, max_words)`: Determines the vocabulary
size, ensuring it does not exceed `max_words`.
`model = build_lstm_model(...)`: Creates the LSTM model with specified parameters.

`model.summary()`:
Prints a summary of the model architecture, including layers, output
shapes, and number of parameters.
Usage: The LSTM model is designed to capture the sequential nature of text data, making it well-
suited for tasks like sentiment analysis where context and word order matter.

Cell 22: Preparing Labels and Training the LSTM Model


python

# %%
# Map string labels to numeric values
emotion_mapping = {'happy': 1, 'not happy': 0}
y_train = pd.Series(y_train).map(emotion_mapping) # Ensure it's a Pandas Series
y_val = pd.Series(y_val).map(emotion_mapping)

# Ensure labels are numeric


y_train = np.array(y_train, dtype='float32')
y_val = np.array(y_val, dtype='float32')

# Split data into training and validation sets


X_train, X_val, y_train, y_val = train_test_split(X_train_padded, y_train, test_size=0.2, ra
ndom_state=42)

# Early stopping to prevent overfitting


early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# Train the model


history = model.fit(
X_train, y_train,
validation_data=(X_val, y_val),
batch_size=32,
epochs=20,
callbacks=[early_stopping]
)

Explanation:

Purpose: Prepares the labels for training, splits the data appropriately, sets up early stopping, and
trains the LSTM model.
Components:
Mapping Labels:
`emotion_mapping`:
Same as before, mapping 'happy' to 1 and 'not happy' to 0.
`pd.Series(y_train).map(emotion_mapping)`: Ensures that `y_train` and `y_val` are
Pandas Series before mapping.
Ensuring Numeric Labels:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 24/33
`np.array(..., dtype='float32')`: Converts the mapped labels to NumPy arrays of type
float32, which is compatible with Keras models.
Data Splitting:
`train_test_split(...)`: Further splits the padded training data and labels into training
and validation sets.
`X_train_padded`: Padded sequences from earlier.
`y_train`: Numeric labels.
`test_size=0.2`: 20% of the data is used for validation.
`random_state=42`: Ensures reproducibility.
Early Stopping:
`EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)`:
`monitor='val_loss'`: Monitors the validation loss.
`patience=3`: Training stops if the validation loss doesn't improve for 3 consecutive
epochs.
`restore_best_weights=True`: Restores the model weights from the epoch with the
best validation loss.
Model Training:
`model.fit(...)`: Trains the LSTM model.
`X_train, y_train`: Training data and labels.
`validation_data=(X_val, y_val)`: Validation data and labels.
`batch_size=32`: Number of samples per gradient update.
`epochs=20`: Maximum number of epochs to train.
`callbacks=[early_stopping]`: Applies early stopping during training.
Outputs:
`history`:Contains training and validation loss and accuracy for each epoch, useful for
plotting learning curves.
Usage: This cell finalizes the data preparation and initiates the training process, leveraging early
stopping to prevent overfitting and ensure the model generalizes well to unseen data.

Cell 23: Classification Report for LSTM Model


python

# %%
from sklearn.metrics import classification_report
# Predict on the validation set
y_val_pred_probs = model.predict(X_val)
y_val_preds = (y_val_pred_probs >= 0.5).astype(int) # Convert probabilities to binary predi
ctions

# Generate classification report


print("Classification Report:")
print(classification_report(y_val, y_val_preds, target_names=['not happy', 'happy']))

Explanation:

Purpose: Evaluates the trained LSTM model's performance on the validation set by generating a
classification report that includes precision, recall, f1-score, and support for each class.
Components:

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 25/33
Prediction:
`model.predict(X_val)`: Generates predicted probabilities for each sample in the
validation set.
`(y_val_pred_probs >= 0.5).astype(int)`: Converts probabilities to binary class
predictions using a threshold of 0.5. Probabilities ≥ 0.5 are classified as class `1`
("happy"), and those < 0.5 as class `0` ("not happy").
Classification Report:
`classification_report(y_val, y_val_preds, target_names=['not happy', 'happy'])`:
Generates a text report showing the main classification metrics for each class.
Output:
Print Statement: Displays the classification report in the console.
Usage: Provides a detailed understanding of the model's performance, highlighting strengths and
areas for improvement in terms of correctly predicting each class.

Cell 24: Confusion Matrix for LSTM Model


python

# %%
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Generate confusion matrix


cm = confusion_matrix(y_val, y_val_preds)

# Plot confusion matrix


plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['not happy', 'happy'], ytick
labels=['not happy', 'happy'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix for LSTM Model')
plt.show()

Explanation:

Purpose: Similar to Cell 19, this cell creates a confusion matrix for the LSTM model's predictions,
providing a visual representation of the model's performance.
Components:
Generating the Confusion Matrix:
`confusion_matrix(y_val, y_val_preds)`: Compares true labels with predicted labels to
form the matrix.
Plotting:
`sns.heatmap(...)`:Creates a heatmap visualization of the confusion matrix.
`annot=True`: Annotates each cell with its count.
`fmt='d'`: Formats annotations as integers.
`cmap='Blues'`: Sets the color scheme.
`xticklabels` and `yticklabels`: Labels the axes with class names.
Labels and Title: Adds appropriate labels and a title to the plot.
Interpretation:

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 26/33
True Positives (TP): Correctly predicted "happy" reviews.
True Negatives (TN): Correctly predicted "not happy" reviews.
False Positives (FP): "Not happy" reviews incorrectly predicted as "happy".
False Negatives (FN): "Happy" reviews incorrectly predicted as "not happy".
Usage: Enables a quick visual assessment of the types of errors the LSTM model is making, aiding
in diagnosing model performance issues.

Cell 25: Preparing for Aspect-Based Sentiment Analysis


python

# %%
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

Explanation:

Purpose: Sets up additional NLP tools required for aspect-based sentiment analysis, which involves
identifying sentiments related to specific aspects or components within a review.

Components:
Imports:
`WordNetLemmatizer`: For lemmatizing words (reducing words to their base or dictionary
form).
`word_tokenize`: For tokenizing sentences into words.
Initialization:
`lemmatizer = WordNetLemmatizer()`: Creates an instance of the lemmatizer.
`stop_words = set(stopwords.words('english'))`: Reinitializes the set of English
stopwords for use in subsequent text processing.
Usage: These tools are essential for processing text data in a more granular manner, especially
when focusing on specific aspects within the text.

Cell 26: Re-defining the Text Cleaning Function


python

# %%
pattern = re.compile(r"[^a-zA-Z ]")
stop_words = set(stopwords.words('english'))

def clean_text(text):
text = pattern.sub("", text) # Remove special characters and numbers
tokens = text.lower().split() # Convert to lowercase and tokenize

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 27/33
tokens = [word for word in tokens if word not in stop_words] # Remove stopwords
return " ".join(tokens)

Explanation:

Purpose: Redefines the `clean_text` function, similar to `clean_text_optimized` from Cell 12, for
use in aspect-based sentiment analysis.

Components:
Pattern Compilation:
`re.compile(r"[^a-zA-Z ]")`: Compiles a regex pattern to remove non-alphabetic
characters and numbers.
Function `clean_text`:
`pattern.sub("", text)`: Removes unwanted characters from the text.
`text.lower().split()`: Converts text to lowercase and splits it into tokens.
Stopword Removal: Filters out tokens that are present in the `stop_words` set.
Returns: A cleaned, space-separated string of tokens.
Differences from `clean_text_optimized`:
No Use of List Comprehension: This version uses a standard list comprehension within the
function, whereas `clean_text_optimized` applied the cleaning over a list comprehension
outside the function.
Potential for Lemmatization: Although not implemented here, having the `lemmatizer`
initialized suggests potential future enhancements.

Usage: Prepares text data for aspect-based sentiment analysis by ensuring that the text is clean
and free from noise, focusing on relevant words.

Cell 27: Detecting Aspects in Reviews


python

# %%
def detect_aspects(review):
aspects = {}
review_lower = review.lower()
if any(keyword in review_lower for keyword in ["room", "bed", "suite"]):
aspects['room'] = review
if any(keyword in review_lower for keyword in ["service", "staff", "reception"]):
aspects['service'] = review
return aspects

Explanation:

Purpose: Identifies specific aspects within a review that are relevant for aspect-based sentiment
analysis. In this case, the aspects are "room" and "service".
Function `detect_aspects`:
Parameters:
`review`: A string containing the text of a single review.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 28/33
Process:
1. Initialization:
`aspects = {}`: Initializes an empty dictionary to store detected aspects and their
corresponding text.
2. Lowercasing:
`review_lower = review.lower()`: Converts the review text to lowercase to ensure
case-insensitive matching.
3. Aspect Detection:
Room-Related Keywords: Checks if any of the keywords ["room", "bed", "suite"]
are present in the review. If so, adds an entry `'room': review` to the `aspects`
dictionary.
Service-Related Keywords: Checks if any of the keywords ["service", "staff",
"reception"] are present. If so, adds `'service': review` to `aspects`.
4. Returns:
`aspects`: A dictionary mapping detected aspects to the original review text.
Usage: Extracts parts of the review that pertain to specific aspects, enabling targeted sentiment
analysis on those aspects rather than the entire review.
Example:
Review: "The room was clean and spacious, and the service was good."
Detected Aspects:
`'room': "The room was clean and spacious, and the service was good."`
`'service': "The room was clean and spacious, and the service was good."`

Cell 28: Predicting Sentiment for Each Aspect


python

# %%
# Function to predict sentiment for each aspect
def predict_aspect_sentiment(review):
aspects = detect_aspects(review)
results = {}
for aspect, text in aspects.items():
cleaned_text = clean_text(text)
sequence = tokenizer.texts_to_sequences([cleaned_text])
padded_sequence = pad_sequences(sequence, maxlen=max_len, padding='post')
prediction = model.predict(padded_sequence)
results[aspect] = 'happy' if prediction >= 0.5 else 'not happy'
return results

Explanation:

Purpose: Predicts the sentiment ("happy" or "not happy") for each detected aspect within a review.
Function `predict_aspect_sentiment`:
Parameters:
`review`: A string containing the text of a single review.
Process:
1. Aspect Detection:

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 29/33
`aspects = detect_aspects(review)`: Identifies aspects within the review using the
previously defined `detect_aspects` function.
2. Initialization:
`results = {}`: Initializes an empty dictionary to store sentiment predictions for
each aspect.
3. Sentiment Prediction Loop:
For each detected aspect and its corresponding text:
Text Cleaning:
`cleaned_text = clean_text(text)`: Cleans the aspect-specific text.
Tokenization and Padding:
`sequence = tokenizer.texts_to_sequences([cleaned_text])`: Converts
the cleaned text into a sequence of integers using the tokenizer.
`padded_sequence = pad_sequences(sequence, maxlen=max_len,
padding='post')`: Pads the sequence to ensure uniform length.
Prediction:
`model.predict(padded_sequence)`: Generates a probability prediction
using the trained LSTM model.
Assigning Sentiment:
`'happy' if prediction >= 0.5 else 'not happy'`: Converts the
probability into a binary sentiment label based on a threshold of 0.5.
Storing Results:
`results[aspect] = sentiment`: Stores the predicted sentiment for the
aspect in the `results` dictionary.
4. Returns:
`results`:A dictionary mapping each detected aspect to its predicted sentiment.
Usage: Enables granular sentiment analysis by associating sentiments with specific aspects
mentioned in a review, providing more detailed insights than overall sentiment analysis.
Example:
Review: "The room was clean and spacious, and the service was good."
Detected Aspects: `'room'` and `'service'`.
Predictions:
`'room': 'happy'`
`'service': 'happy'`

Cell 29: Predicting Sentiments for Multiple Reviews


python

# %%
# Function to predict sentiment for multiple reviews
def predict_aspect_sentiments_multiple(reviews):
results = [] # Store the results for each review
for review in reviews:
# Detect aspect-based sentiments
predicted_sentiments = predict_aspect_sentiment(review)

# Determine overall sentiment


overall_sentiment = (
'happy' if list(predicted_sentiments.values()).count('happy') >
list(predicted_sentiments.values()).count('not happy') else 'not happy'
)

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 30/33
# Append the results for the current review
results.append({
'Review': review,
'Aspect-Based Sentiments': predicted_sentiments,
'Overall Sentiment': overall_sentiment
})
return results

# Example reviews
reviews = [
"The room was clean and spacious, and the service was good",
"room was very bad.",
"The service was very bad and room was also bad."
]

# Predict sentiments for multiple reviews


review_analysis = predict_aspect_sentiments_multiple(reviews)

# Print the results for each review


for analysis in review_analysis:
print(f"Review: {analysis['Review']}")
print(f"Aspect-Based Sentiments: {analysis['Aspect-Based Sentiments']}")
print(f"Overall Sentiment: {analysis['Overall Sentiment']}")
print("---")

Explanation:

Purpose: Applies the aspect-based sentiment prediction function to multiple reviews, aggregates
the aspect sentiments to determine an overall sentiment for each review, and prints the results.
Function `predict_aspect_sentiments_multiple`:
Parameters:
`reviews`: A list of review strings.
Process:
1. Initialization:
`results = []`:Initializes an empty list to store the analysis results for each review.
2. Loop Through Reviews:
For each `review` in `reviews`:
Aspect-Based Sentiment Prediction:
`predicted_sentiments = predict_aspect_sentiment(review)`: Gets a
dictionary of sentiments for each detected aspect.
Overall Sentiment Determination:
Counts the number of 'happy' and 'not happy' sentiments across
aspects.
Assigns the overall sentiment as 'happy' if there are more 'happy'
sentiments, otherwise 'not happy'.
Appending Results:
**Adds a dictionary containing the original review, aspect-based
sentiments, and overall sentiment to the `results` list.
3. Returns:
`results`: A list of dictionaries, each containing analysis for a review.
Example Reviews:
1. "The room was clean and spacious, and the service was good"
Detected Aspects: 'room' and 'service'.
Predicted Sentiments: 'happy' for both.
Overall Sentiment: 'happy'.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 31/33
2. "room was very bad."
Detected Aspect: 'room'.
Predicted Sentiment: 'not happy'.
Overall Sentiment: 'not happy'.
3. "The service was very bad and room was also bad."
Detected Aspects: 'service' and 'room'.
Predicted Sentiments: 'not happy' for both.
Overall Sentiment: 'not happy'.
Output:
Print Statements: For each review, prints the original text, the sentiments associated with
each detected aspect, and the overall sentiment.
Usage: Demonstrates how the aspect-based sentiment analysis can be applied to multiple reviews,
providing both detailed and aggregated sentiment insights.

Cell 30: Empty Cell


python

# %%

Explanation:

Purpose: This is an empty cell, possibly left for future code additions or as a placeholder.
Usage: Can be used to add additional code or perform further analysis as needed.

Overall Workflow Summary:


1. Data Import and Exploration:
Import necessary libraries.
Load training and testing datasets.
Display and inspect datasets for structure and missing values.
Visualize distributions of key categorical variables like browser used, device type, and
emotions.
2. Data Preprocessing:
Generate word clouds to visualize common words in different emotion categories.
Clean text data by removing special characters, numbers, and stopwords.
Compute class weights to handle class imbalance.
Map categorical labels to numeric values.
Split data into training and validation sets.
3. Model Training and Evaluation:
Random Forest:
Perform TF-IDF vectorization.
Train a Random Forest classifier.
Evaluate using classification report and confusion matrix.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 32/33
LSTM Neural Network:
Tokenize and pad sequences.
Build and compile an LSTM-based model.
Train the model with early stopping.
Evaluate using classification report and confusion matrix.
4. Aspect-Based Sentiment Analysis:
Define functions to detect specific aspects within reviews.
Predict sentiments for each detected aspect.
Aggregate aspect sentiments to determine overall sentiment.
Apply the analysis to example reviews and display results.

Additional Recommendations:
Handling Imbalanced Classes:
While class weights are computed, ensure that they are applied during model training,
especially for the LSTM model if not already done.
Model Improvement:
Experiment with different model architectures, hyperparameters, and feature engineering
techniques to enhance performance.
Consider using pre-trained embeddings like GloVe or Word2Vec for the Embedding layer to
leverage semantic information.
Aspect Detection Enhancement:
Use more sophisticated methods for aspect extraction, such as dependency parsing or
named entity recognition, to improve the accuracy of aspect detection.
Incorporate lemmatization and stemming to normalize words.
Evaluation Metrics:
In addition to accuracy, consider other metrics like ROC-AUC, especially if dealing with
imbalanced classes.
Plot learning curves to monitor for overfitting or underfitting.
Deployment Considerations:
Save the trained models and tokenizer for future use in deploying the model to production.
Implement a pipeline for processing new reviews and predicting sentiments in real-time.

Feel free to ask if you need further clarification on any part of the code or if you have additional
questions!

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 33/33

You might also like