Complete Report
Complete Report
MULTIMODAL PERSPECTIVE
A PROJECT REPORT
Submitted by
BACHELOR OF ENGINEERING
in
COMPUTER SCIENCE AND ENGINEERING
i
ANNA UNIVERSITY, CHENNAI
BONAFIDE CERTIFICATE
SIGNATURE SIGNATURE
ii
ACKNOWLEDGEMENT
At the outset, we would like to express our sincere gratitude to our beloved Dr.
B. Babu Manoharan M.A., M.B.A., Ph.D., Chairman, St. Joseph’s Group of
Institutions for his constant guidance and support to the student community and the
Society.
iii
ABSTRACT
With the exponential growth of internet usage, sentiment analysis has emerged as
a pivotal domain within natural language processing (NLP). Leveraging sentiment
analysis, one can effectively mine the implicit emotions embedded in textual content
across various contexts. Given the extensive utilization of social media platforms,
where users exchange vast amounts of information, mining such data to gauge
sentiments becomes instrumental in understanding public opinion. Thus, the focal point
of this study is to delve into the sentiment analysis of user-generated content on Twitter.
To conduct this research, a dataset comprising 13,000 tweets was curated from Kaggle.
Subsequently, employing the natural language toolkit in Python, the collected data
underwent preprocessing. Annotation was facilitated through the utilization of
TextBlob. Through rigorous experimentation, employing a machine learning algorithm,
the study attained a commendable accuracy rate of 95.4%. The Bidirectional Long
Short-Term Memory (BiLSTM) neural network, coupled with unigrams, emerged as the
most effective approach in sentiment analysis for this study.
The findings of this research point towards a notable trend: the majority of users tend
to express sentiments that align with the prevalent topics being discussed on Twitter.
This insight underscores the influence of contextual factors on user sentiment
expression within social media discourse. By comprehensively analyzing user
sentiments, particularly within the Twitter ecosystem, researchers gain valuable insights
into public sentiment dynamics, which can inform various domains such as marketing,
politics, and public opinion monitoring.
iv
TABLE OF CONTENTS
CHAPTER TITLE PAGE NO.
NO.
ABSTRACT iv
LIST OF FIGURES vii
LIST OF TABLES ix
LIST OF ABBREVIATIONS x
LIST OF SYMBOLS xi
1 INTRODUCTION 1
1.1 SENTIMENT ANALYSIS 1
1.2 NLP 1
1.2.1 TEXTBLOB 1
1.3 STEPS OF NLP 1
1.4 TYPES OF NLP 2
1.5 HOW DOES NLP WORKS? 3
2 LITERATURE SURVEY 4
2.1 EXISTING SYSTEM 4
2.2 RELATED WORKS 5
2.3 PROPOSED SYSTEM 6
3 SYSTEM STUDY 8
3.1 SCOPE 8
3.2 PRODUCT FUNCTION 8
3.3 SYSTEM REQUIREMENTS 9
3.3.1 HARDWARE INTERFACES 9
3.3.2 SOFTWARE INTERFACES 9
3.3.2.1 ANACONDA 9
3.3.2.2 SPYDER 9
3.3.2.3 PYTHON 10
3.3.2.4 GOOGLE COLAB 10
4 SYSTEM DESIGN 11
4.1 OVERVIEW 11
4.2 OVERALL ARCHITECTURE 11
4.3 MODULES 11
4.3.1 BIDIRECTIONAL LONG SHORT-TERM 12
MEMORY 13
4.3.2 NEURAL NETWORK
v
5 SYSTEM IMPLEMENTATION 15
5.1 OVERVIEW 15
5.2 ESSENTIAL LIBRARIES 15
5.2.1 PANDAS 15
5.2.2 TEXTBLOB 15
5.2.3 NLTK 15
5.2.4 MATPLOTLIB 15
5.2.5 SNOWBALLSTEMMER 16
5.3 FUNCTIONS USED FOR IMPLEMENTATION 16
5.3.1 DATA EXTRACT 16
5.3.2 CLEAN TWEET 16
5.3.3 SUBJECTIVITY 16
5.3.4 POLARITY 17
5.3.5 TOKENIZATION 17
17
5.3.6 STEMMING
6 RESULTS AND EVALUATION 18
6.1 PERFORMANCE METRICS 18
6.1.1 OVERVIEW 18
6.1.2 CONFUSION MATRIX 18
6.1.3 F1-SCORE 19
6.1.4 PRECISION 19
6.1.5 RECALL 19
6.1 PERFOR 6.2 RESULTS AND DISCUSSION 20
6.2.1 OVERVIEW 20
6.2.2 DATASET 20
6.2.2.1 STATIONARY DATASET 20
6.2.3 SCREENSHOTS 21
7 CONCLUSION FUTURE ENHANCEMENT 30
7.1 CONCLUSION 30
7.2 FUTURE ENHANCEMENT 30
8 APPENDICES 31
9 REFERENCES 45
vi
LIST OF FIGURES
vii
6.9 Distribution of text length for positive sentiment 24
tweets
6.10 Distribution of text length for negative sentiment 24
tweets
6.11 Pie Chart Of Different Sentiments of tweets 25
viii
LIST OF TABLES
ix
LIST OF ABBREVIATIONS
FP False Positive
TP True Positive
FN False Negative
False Positive
FP
NN Neural Network
x
LIST OF SYMBOLS
NOTATION MEANING
X Dataset
𝑤𝑖 Weights
P Precision
F F1-Score
L Labeling
R Recall
xi
CHAPTER 1
INTRODUCTION
1.2 NLP
1.2.1 TEXTBLOB
• Lemmatization
1
• Stemming
• POS Tagging
• NER
Rule-based system:
This system uses carefully designed linguistic rules. This approach was used
early on in the development of natural language processing, and is still used.
2
Using a combination of machine learning, deep learning and neural networks, natural
language processing algorithms hone their own rules through repeated processing and
learning.
The journey of NLP begins with input acquisition, where raw text or speech data is
collected from various sources. This data undergoes preprocessing, a crucial step that
involves cleaning, formatting, and organizing the input to prepare it for further analysis.
Techniques like tokenization, where text is segmented into meaningful units like words
or sentences, and normalization, which standardizes text to a consistent format, help
streamline the processing pipeline. Following preprocessing, the language
understanding phase takes center stage, where NLP models delve into syntactic,
semantic, and discourse analysis to grasp the underlying meaning of the text. Syntax
analysis parses the grammatical structure, semantic analysis deciphers word meanings
and relationships, while discourse analysis considers the broader context to ensure
coherence and relevance.
In recent years, the use of social networking sites has significantly increased, generating
vast amounts of data as users express their views and opinions. This paper discusses
sentiment extraction from Twitter, where users post their opinions and views on various
topics. Sentiment analysis is conducted on tweets to provide insights for business
intelligence purposes. The system employs the Hadoop Framework to process movie
dataset available on Twitter, including reviews, feedback, and comments. Results of
sentiment analysis are presented categorically, highlighting positive, negative, and
neutral sentiments. The paper emphasizes the importance of sentiment analysis in
understanding public opinions, especially on social media platforms where individuals
frequently seek reviews and opinions on various subjects. Natural Language Processing
(NLP) techniques are utilized to analyze tweets and determine people's thoughts on
specific topics.
The analysis of sentiments helps in discerning whether the sentiment expressed in text
is positive, negative, or neutral, providing valuable insights into societal sentiments.
4
2.2 RELATED WORKS
2.2.1 Summary of Related Works:
Sentiment Analysis on Twitter Data:
This study explores sentiment analysis techniques applied to Twitter data,
recognizing the significance of understanding public opinions on social media platforms.
It utilizes Natural Language Processing (NLP) techniques such as tokenization,
elimination of stop words, and stemming to analyze sentiments expressed in tweets. The
system employs lexicons and multiplication polarity for sentiment classification, with a
focus on improving semantic accuracy. The research provides insights into the
methodologies and challenges associated with sentiment analysis on Twitter.
Training Procedure:
The system's training process involves feeding the preprocessed tweet data into
the Bi-directional LSTM Neural Network and adjusting the network's parameters using
backpropagation and gradient descent. Dropout regularization is applied during training
to prevent overfitting. Hyperparameters such as learning rate, batch size, and dropout
rate are optimized to enhance model performance.
7
CHAPTER 3
SYSTEM STUDY
3.1 SCOPE:
The scope of Sentiment Analysis of Twitter Data : A Case Study is to analyze
the sentiments expressed by users on Twitter and classify the tweets according to their
sentiments(positive, negative or neutral) using NLP techniques.
8
filled up with zero, and the resultant matrix will become sparse.
After feature extraction of the preprocessed data set, we have passed the data to
machine learning classifiers
We have used bi-directional LSTM for this purpose.
We have used 80% data for training and 20% data for testing the classifiers.
3.3.2.1 ANACONDA
Anaconda platform is used for machine learning and any other large-scale data
processing. It is an open source distribution which is capable of working with R and
Python programming languages and free of cost. It consists of more than 1500 packages
and virtual environment manager. The virtual environmental manager is named as
Anaconda Navigator and it comprises all the libraries to be installed within.
3.3.2.2 SPYDER
3.3.2.4 TWEEPY
Tweepy is a Python library that simplifies the process of interacting with the
Twitter API, offering easy authentication handling, versatile API functionality, flexible
tweet handling, rich user interaction, real-time streaming capabilities, robust error
handling and rate limiting, comprehensive documentation, and strong community
support. It serves as an indispensable tool for developers seeking to integrate Twitter
functionality into their Python applications, enabling them to access Twitter data and
functionality efficiently and effectively.
10
CHAPTER 4
SYSTEM DESIGN
4.1 OVERVIEW
This section presents the overview of the whole system. The Section 4.2 shows
the system Section 4.2 defines the main three modules used Section 4.3.1 defines how
clusters are formed with the unbounded data streams Section 4.3.2 defines the merge
operation with the previously formed rough clusters Section 4.3.3 describes how the
clusters are categorized and stored offline.
4.3 MODULES
Neural Network
Bi-Directional LSTM
11
4.3.1 BIDIRECTIONAL LSTM
12
Figure 4.3 LSTM Sigmoid And Computation Formula
13
Figure 4.4 Neural Network Diagram
14
CHAPTER 5
IMPLEMENTATION METHODOLOGY
5.1 OVERVIEW:
The implementation methodology describes the main functional requirements,
which needed for doing the project.
5.2.1 PANDAS:
Pandas is a Python package providing fast, flexible, and expressive data
structures designed to make working with “relational” or “labeled” data both easy and
intuitive. It aims to be the fundamental high-level building block for doing practical,
real-world data analysis in Python.
5.2.2 TEXTBLOB:
TextBlob is a Python (2 and 3) library for processing textual data. It provides a
simple API for diving into common natural language processing (NLP) tasks such as
part-of-speech tagging, noun phrase extraction, sentiment analysis, classification,
translation, and more.
5.2.3 NLTK:
NLTK stands for Natural Language ToolKit, which is a toolkit build for working
with NLP in Python. It provides us various text processing libraries with a lot of test
datasets. A variety of tasks can be performed using NLTK such as tokenizing, parse tree
visualization, etc…
5.2.4 MATPLOTLIB:
Matplotlib is one of the most common packages used for data visualization in
python. It is a cross-platform library for creating 2-Dimensional plots from data in
arrays.
15
MATPLOTLIB.PYPLOT:
The matplotlib.pyplot is a collection of command style functions that make
matplotlib work like MATLAB. Each pyplot function makes some change to a figure:
e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting
area, decorates the plot with labels, etc. In this project matplotlib.pyplot is used to
generate graph such as Bar graph, Cartesian graphs which represents the different
Parameters Vs performance of the algorithm.
5.2.4 SNOWBALLSTEMMER:
It is a stemming algorithm which is also known as the Porter2 stemming
algorithm as it is a better version of the Porter Stemmer since some issues of it were
fixed in this stemmer.
5.3.3 SUBJECTIVITY:
The Subjectivity methods is used inside the textblob. Subjectivity is the output
16
that lies within [0,1] and refers to personal opinions and judgments.
5.3.4 POLARITY:
In this method, textblob uses a polarity to classify the tweets and the polarity
range between -1 to +1. The polarity classify the tweets as positive, negative and netural
tweets.
5.3.5 TOKENIZATION:
Word tokenization is the done for splitting a large sample of text into words.
This is a requirement in natural language processing tasks where each word needs to be
captured and subjected to further analysis like classifying and counting them for a
particular sentiment etc.
5.3.6 STEMMING:
Stemming is the process of reducing inflection in words to their root forms such
as mapping a group of words to the same stem even if the stem itself is not a valid word
in the Language.
17
CHAPTER 6
6.1 PERFORMANCE METRICS
6.1.1 OVERVIEW:
Our algorithm is evaluated across three metrics: 1) confusion matrix 2) F1- Score
3) Precision 4) Recall. The performance metrics of the collected tweets is compared with
eight important supervised machine learning algorithms that is Bi-Directional Long
Short Term Memory Neural Network Algorithm.
• True Positives (TP) − It is the case when both actual class & predicted class of data point is 1.
18
• True Negatives (TN) − It is the case when both actual class & predicted class of data
point is 0.
• False Positives (FP) − It is the case when actual class of data point is 0 & predicted
6.1.3 F1-SCORE:
This score will give us the harmonic mean of precision and recall.
Mathematically, F1 score is the weighted average of the precision and recall. The best
value of F1 would be 1 and worst would be 0. We can calculate F1 score with the help
of following formula –
F1 = 2 * ((Precision * Recall) / Precision + Recall )
6.1.4 PRECISION:
Precision, used in document retrievals, may be defined as the number of correct
documents returned by our ML model. We can easily calculate it by confusion matrix
with the help of following formula –
Precision=TP/TP+FP
`Where,
TP - True
Positive FP-
False Positive
6.1.5 RECALL:
Recall may be defined as the number of positives returned by our ML model.
We can easily calculate it by confusion matrix with the help of following formula −
19
Where,
TP - True Positive
FN - False Negative
6.2.1 OVERVIEW
This chapter explains the result of our project and the screenshots for each
step are included and explained
6.2.2DATASETS:
twitter-and-reddit- 2 4 8,000
sentimental-
analysis-dataset
appletwittersentim 2 8 5787
enttexts
twitterdata 2 13 10,587
twitter-sentiment- 2 7 6,347
dataset
20
6.2.3 SCREENSHOTS
21
Figure 6.4 Data Preprocessing Of twitter-data
22
Figure 6.6 Data visualization is the graphical representation of data to facilitate
understanding, analysis, and communication of insights.
Figure 6.7 Data Visualization of Mean ,Median ,Mode Values For Twitter Data(Tweets)
23
Figure 6.8 Distribution of text attributes for biclass of the data
25
Figure 6.12 Word Count
26
Figure 6.14 TensorFlow Model Diagram
27
Figure 6.15 Confusion Matrix of True Labels for Bi-Directional LSTM(NN)
28
Figure 6.16 Model Accuracy And Model Loss Graph
29
CHAPTER 7
CONCLUSION
7.1 CONCLUSION
Social media is witnessing a massive increase in the number of users per day.
People prefer to share their honest opinions on social media instead of sharing with
someone in person. Using the posts from Twitter, we examined the common public’s
aggregate reaction toward various topics. Motivated by the mixed reactions coming
after certain events on Twitter, we collected tweets during specific time periods. We
have applied collected data to bi-directional LSTM for annotation and preprocessing.
We have observed the best performance with the bi-directional LSTM and unigram.
The combination gives us an accuracy of 95.4%, which is best in all the combinations
which we have executed on our data set. We have consolidated the performance by
calculating precision, recall, F1-Score, and tenfold cross-validation for all the
combinations, and we got the best results with bi-directional LSTM and unigram. So,
we executed the sentiment analysis of tweets by the public during specific events using
this combination and found that almost half of the population (42.8%) is talking positive
about the topic, 33.1% are neutral, and 24.1% of the people are feeling negative due to
some reason.
30
APPENDIX
CODE:
!pip install opendatasets
import numpy as np # linear algebra
import pandas as pd # data processing
import os
import tweepy as tw #for accessing Twitter API
import opendatasets as od
#For Preprocessing
import re # RegEx for removing non-letter characters
import nltk #natural language processing
nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.stem.porter import *
pd.options.plotting.backend = "plotly"
31
od.download("https://www.kaggle.com/datasets/cosmos98/twitter-and-reddit-
sentimental-analysis-dataset")
od.download("https://www.kaggle.com/datasets/seriousran/appletwittersentimenttexts
")
od.download("https://www.kaggle.com/datasets/surajkum1198/twitterdata")
od.download("https://www.kaggle.com/datasets/crowdflower/twitter-airline-
sentiment")
od.download("https://www.kaggle.com/datasets/saurabhshahane/twitter-sentiment-
dataset")
df2.head()
32
# Output first five rows
df3.head()
33
df.groupby('category').count().plot(kind='bar')
fig = plt.figure(figsize=(14,7))
df['length'] = df.clean_text.str.split().apply(len)
ax1 = fig.add_subplot(122)
sns.histplot(df[df['category']=='Positive']['length'], ax=ax1,color='green')
describe = df.length[df.category=='Positive'].describe().to_frame().round(2)
ax2 = fig.add_subplot(121)
ax2.axis('off')
font_size = 14
bbox = [0, 0, 1, 1]
table = ax2.table(cellText = describe.values, rowLabels = describe.index, bbox=bbox,
colLabels=describe.columns)
table.set_fontsize(font_size)
fig.suptitle('Distribution of text length for positive sentiment tweets.', fontsize=16)
plt.show()
fig = plt.figure(figsize=(14,7))
df['length'] = df.clean_text.str.split().apply(len)
ax1 = fig.add_subplot(122)
sns.histplot(df[df['category']=='Negative']['length'], ax=ax1,color='red')
34
describe = df.length[df.category=='Negative'].describe().to_frame().round(2)
ax2 = fig.add_subplot(121)
ax2.axis('off')
font_size = 14
bbox = [0, 0, 1, 1]
table = ax2.table(cellText = describe.values, rowLabels = describe.index, bbox=bbox,
colLabels=describe.columns)
table.set_fontsize(font_size)
fig.suptitle('Distribution of text length for Negative sentiment tweets.', fontsize=16)
plt.show()
import plotly.express as px
fig = px.pie(df, names='category', title ='Pie chart of different sentiments of tweets')
fig.show()
35
- df: tweets dataset
- category: Positive/Negative/Neutral
'''
# Combine all tweets
combined_tweets = " ".join([tweet for tweet in
df[df.category==category]['clean_text']])
36
def tweet_to_words(tweet):
''' Convert tweet text into a sequence of words '''
# convert to lowercase
text = tweet.lower()
# remove non letters
text = re.sub(r"[^a-zA-Z0-9]", " ", text)
# tokenize
words = text.split()
# remove stopwords
words = [w for w in words if w not in stopwords.words("english")]
# apply stemming
words = [PorterStemmer().stem(w) for w in words]
# return list
return words
y = pd.get_dummies(df['category'])
37
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25,
random_state=1)
vocabulary_size = 5000
# Tweets have already been preprocessed hence dummy function will be passed in
# to preprocessor & tokenizer step
count_vector = CountVectorizer(max_features=vocabulary_size,
# ngram_range=(1,2), # unigram and bigram
preprocessor=lambda x: x,
tokenizer=lambda x: x)
#tfidf_vector = TfidfVectorizer(lowercase=True, stop_words='english')
38
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
max_words = 5000
max_len=50
def tokenize_pad_sequences(text):
'''
This function tokenize the input text into sequnences of intergers and then
pad each sequence to the same length
'''
# Text tokenization
tokenizer = Tokenizer(num_words=max_words, lower=True, split=' ')
tokenizer.fit_on_texts(text)
# Transforms text to a sequence of integers
X = tokenizer.texts_to_sequences(text)
# Pad sequences to the same length
X = pad_sequences(X, padding='post', maxlen=max_len)
# return sequences
return X, tokenizer
import pickle
# saving
with open('tokenizer.pickle', 'wb') as handle:
39
pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
# loading
with open('tokenizer.pickle', 'rb') as handle:
tokenizer = pickle.load(handle)
y = pd.get_dummies(df['category'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25,
random_state=1)
print('Train Set ->', X_train.shape, y_train.shape)
print('Validation Set ->', X_val.shape, y_val.shape)
print('Test Set ->', X_test.shape, y_test.shape)
import keras.backend as K
f1_val = 2*(precision*recall)/(precision+recall+K.epsilon())
return f1_val
40
from keras.callbacks import LearningRateScheduler
from keras.callbacks import History
vocab_size = 5000
embedding_size = 32
epochs=20
learning_rate = 0.1
decay_rate = learning_rate / epochs
momentum = 0.8
import tensorflow as tf
tf.keras.utils.plot_model(model, show_shapes=True)
print(model.summary())
41
# Compile model
model.compile(loss='categorical_crossentropy', optimizer=sgd,
metrics=['accuracy', Precision(), Recall()])
# Train model
batch_size = 64
history = model.fit(X_train, y_train,
validation_data=(X_val, y_val),
batch_size=batch_size, epochs=epochs, verbose=1)
def plot_training_hist(history):
'''Function to plot history for accuracy and loss'''
42
ax[0].set_ylabel('accuracy')
ax[0].legend(['train', 'validation'], loc='best')
# second plot
ax[1].plot(history.history['loss'])
ax[1].plot(history.history['val_loss'])
ax[1].set_title('Model Loss')
ax[1].set_xlabel('epoch')
ax[1].set_ylabel('loss')
ax[1].legend(['train', 'validation'], loc='best')
plot_training_hist(history)
43
plt.xlabel('Actual label', fontsize=12)
plt.ylabel('Predicted label', fontsize=12)
# Load model
model = load_model('best_model.h5')
def predict_class(text):
'''Function to predict sentiment class of the passed text'''
44
REFERENCES
REFERENCES
[1] R. Liu, Y. Shi, C. Jia, and M. Jia, “A survey of sentiment analysis based on transfer
learning,” IEEE Access, vol. 7, pp. 85401–85412, 2019.
[2] B. R. Naiknaware and S. S. Kawathekar, “Prediction of 2019 Indian election using
sentiment analysis,” in Proc. 2nd Int. Conf., Aug. 2018, pp. 660–665.
[3] D. D. Wu, L. Zheng, and D. L. Olson, “A decision support approach for online stock
forum sentiment analysis,” IEEE Trans. Syst., Man, Cybern. Syst., vol. 44, no. 8, pp.
1077–1087, Aug. 2014.
[4] J. Ding, H. Sun, X. Wang, and X. Liu, “Entity-level sentiment analysis of issue
comments,” in Proc. 3rd Int. Workshop Emotion Awareness Softw. Eng., Jun. 2018, pp.
7–13.
[5] M. Pota, M. Esposito, M. A. Palomino, and G. L. Masala, “A subwordbased deep
learning approach for sentiment analysis of political tweets,” in Proc. 32nd Int. Conf.
Adv. Inf. Netw. Appl. Workshops (WAINA), May 2018, pp. 651–656.
[6] X. Wang, F. Wei, X. Liu, M. Zhou and M. Zhang, "Topic sentiment analysis in
twitter: a graph-based hashtag sentiment classification approach", Proceedings of the
20th International Conference on Information Knowledge Management, pp. 1031-
1040, 2011.
[7] A. Quazi and M. K. Srivastava, "Twitter sentiment analysis using machine learning"
in VLSI Microwave and Wireless Technologies, Singapore:Springer Nature Singapore,
pp. 379-389, 2023.
[8] J. Ferdoshi and J. Ferdoshi, "Dataset for twitter sentiment analysis using roberta and
vader", Mendeley Data, 2023.
[9] S. Jawale, "Twitter sentiment analysis", INTERANTIONAL JOURNAL OF
SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT, vol. 07, 04 2023.
[10] J. Lee et al., "Health information technology trends in social media: Using twitter
data", Healthc. Inform. Res., vol. 25, no. 2, pp. 99-105, 2019.
45