Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
31 views

Complete Report

The document discusses sentiment analysis of user tweets using natural language processing techniques. It describes collecting a dataset of 13,000 tweets and preprocessing the data using Python libraries. A bidirectional long short-term memory neural network achieved 95.4% accuracy in classifying tweet sentiments.

Uploaded by

psycho.3fx
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Complete Report

The document discusses sentiment analysis of user tweets using natural language processing techniques. It describes collecting a dataset of 13,000 tweets and preprocessing the data using Python libraries. A bidirectional long short-term memory neural network achieved 95.4% accuracy in classifying tweet sentiments.

Uploaded by

psycho.3fx
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

TEXT-IMAGE EMOTION SYNTHESIS A

MULTIMODAL PERSPECTIVE

A PROJECT REPORT

Submitted by

ARUN KUMAR T 312321104020


GOPINATH V 312321104048

in partial fulfilment of the award of the degree of

BACHELOR OF ENGINEERING
in
COMPUTER SCIENCE AND ENGINEERING

St. JOSEPH’S COLLEGE OF ENGINEERING


(An Autonomous Institution)
OMR, Chennai 600 119

ANNA UNIVERSITY :: CHENNAI 600 025


APRIL 2024

i
ANNA UNIVERSITY, CHENNAI

BONAFIDE CERTIFICATE

Certified that this project report “Perform Sentiment Analysis On Twitter


Data” is the bonafide work of ARUN KUMAR T (312321104020) AND
GOPINATH V (312321104048) who carried out the work under my guidance.
Certified further that to the best of my knowledge the work reported herein does
not form part of any other thesis or dissertation on the basis of which a degree or
award was conferred on an earlier occasion on this or any other candidate.

SIGNATURE SIGNATURE

HEAD OF THE DEPARTMENT SUPERVISOR


Dr.V Muthulakshmi, M.E., Ph.D., Ms.D.Saranya, M.E.,
Professor & Head of Department Assitant Professor,
Dept. of Computer Science and Engineering, Dept. of Computer Science and Engineering,,
St. Joseph’s college of Engineering, St. Joseph’s college of Engineering,
OMR,Chennai-600 119 OMR Chennai-600 119

Submitted to Project and Viva Examination held on __________________

INTERNAL EXAMINER EXTERNAL EXAMINER

ii
ACKNOWLEDGEMENT
At the outset, we would like to express our sincere gratitude to our beloved Dr.
B. Babu Manoharan M.A., M.B.A., Ph.D., Chairman, St. Joseph’s Group of
Institutions for his constant guidance and support to the student community and the
Society.

We would like to express our hearty thanks to our respected Managing


Director Mr. B. Shashi Sekar, M.Sc. for his kind encouragement and blessings.

We wish to express our sincere thanks to the Executive Director


Mrs. S. Jessie Priya M.Com. for providing ample facilities in the institution.
We express sincere gratitude to our beloved Principal
Dr. Vaddi Seshagiri Rao M.E., M.B.A., Ph.D., F.I.E. for his inspirational ideas
during the course of the project.
We express our sincere gratitude to our beloved Dean
(Research) Dr. A. Chandrasekar M.E., Ph.D., Dean (Student Affairs)
Dr. V. Vallinayagam M.Sc., M.Phil., Ph.D., and Dean (Academics)
Dr. G. Sreekumar M.Sc., M.Tech., Ph.D., for their inspirational ideas during the
course of the project.

We wish to express our sincere thanks to Dr. V Muthulakshmi, M.E., Ph.D.,


Head of the Department, Department of Computer Science and Engineering, St.
Joseph’s College of Engineering for her guidance and assistance in solving the
various intricacies involved in the project.

We would like to acknowledge our profound gratitude to our supervisor Ms.


D. Saranya, M.E., for her/his expert guidance and connoisseur suggestion to carry
out the study successfully.
Finally, we thank the Faculty Members and our Family, who helped and
encouraged us constantly to complete the project successfully.

iii
ABSTRACT

With the exponential growth of internet usage, sentiment analysis has emerged as
a pivotal domain within natural language processing (NLP). Leveraging sentiment
analysis, one can effectively mine the implicit emotions embedded in textual content
across various contexts. Given the extensive utilization of social media platforms,
where users exchange vast amounts of information, mining such data to gauge
sentiments becomes instrumental in understanding public opinion. Thus, the focal point
of this study is to delve into the sentiment analysis of user-generated content on Twitter.

To conduct this research, a dataset comprising 13,000 tweets was curated from Kaggle.
Subsequently, employing the natural language toolkit in Python, the collected data
underwent preprocessing. Annotation was facilitated through the utilization of
TextBlob. Through rigorous experimentation, employing a machine learning algorithm,
the study attained a commendable accuracy rate of 95.4%. The Bidirectional Long
Short-Term Memory (BiLSTM) neural network, coupled with unigrams, emerged as the
most effective approach in sentiment analysis for this study.

The findings of this research point towards a notable trend: the majority of users tend
to express sentiments that align with the prevalent topics being discussed on Twitter.
This insight underscores the influence of contextual factors on user sentiment
expression within social media discourse. By comprehensively analyzing user
sentiments, particularly within the Twitter ecosystem, researchers gain valuable insights
into public sentiment dynamics, which can inform various domains such as marketing,
politics, and public opinion monitoring.

iv
TABLE OF CONTENTS
CHAPTER TITLE PAGE NO.
NO.
ABSTRACT iv
LIST OF FIGURES vii
LIST OF TABLES ix
LIST OF ABBREVIATIONS x
LIST OF SYMBOLS xi
1 INTRODUCTION 1
1.1 SENTIMENT ANALYSIS 1
1.2 NLP 1
1.2.1 TEXTBLOB 1
1.3 STEPS OF NLP 1
1.4 TYPES OF NLP 2
1.5 HOW DOES NLP WORKS? 3
2 LITERATURE SURVEY 4
2.1 EXISTING SYSTEM 4
2.2 RELATED WORKS 5
2.3 PROPOSED SYSTEM 6
3 SYSTEM STUDY 8
3.1 SCOPE 8
3.2 PRODUCT FUNCTION 8
3.3 SYSTEM REQUIREMENTS 9
3.3.1 HARDWARE INTERFACES 9
3.3.2 SOFTWARE INTERFACES 9
3.3.2.1 ANACONDA 9
3.3.2.2 SPYDER 9
3.3.2.3 PYTHON 10
3.3.2.4 GOOGLE COLAB 10
4 SYSTEM DESIGN 11
4.1 OVERVIEW 11
4.2 OVERALL ARCHITECTURE 11
4.3 MODULES 11
4.3.1 BIDIRECTIONAL LONG SHORT-TERM 12
MEMORY 13
4.3.2 NEURAL NETWORK

v
5 SYSTEM IMPLEMENTATION 15
5.1 OVERVIEW 15
5.2 ESSENTIAL LIBRARIES 15
5.2.1 PANDAS 15
5.2.2 TEXTBLOB 15
5.2.3 NLTK 15
5.2.4 MATPLOTLIB 15
5.2.5 SNOWBALLSTEMMER 16
5.3 FUNCTIONS USED FOR IMPLEMENTATION 16
5.3.1 DATA EXTRACT 16
5.3.2 CLEAN TWEET 16
5.3.3 SUBJECTIVITY 16
5.3.4 POLARITY 17
5.3.5 TOKENIZATION 17
17
5.3.6 STEMMING
6 RESULTS AND EVALUATION 18
6.1 PERFORMANCE METRICS 18
6.1.1 OVERVIEW 18
6.1.2 CONFUSION MATRIX 18
6.1.3 F1-SCORE 19
6.1.4 PRECISION 19
6.1.5 RECALL 19
6.1 PERFOR 6.2 RESULTS AND DISCUSSION 20
6.2.1 OVERVIEW 20
6.2.2 DATASET 20
6.2.2.1 STATIONARY DATASET 20
6.2.3 SCREENSHOTS 21
7 CONCLUSION FUTURE ENHANCEMENT 30
7.1 CONCLUSION 30
7.2 FUTURE ENHANCEMENT 30
8 APPENDICES 31
9 REFERENCES 45

vi
LIST OF FIGURES

FIGURE FIGURE NAME PAGE


NO. NO.

1.1 Steps of NLP 2

2.1 LSTM Calculation Formula 6

2.2 Flow Diagram for Bi-Directional LSTM 7

4.1 System architecture 11

4.2 Workflow of Bidirectional LSTM 12

4.3 LSTM Sigmoid And Computational formula 13

4.4 Neural Network Diagram 14

4.5 Neural Network Work Flow Diagram 14

6.1 Confusion Matrix 18

6.2 Stationary Dataset 20

6.3 Data Preprocessing Of twitter-airline-sentiment 21

6.4 Data Preprocessing Of twitter-and-reddit- 21


sentimental-analysis-dataset
6.5 Data Preprocessing Of twitter-data 22

6.6 Data Preprocessing Of appletwittersentimenttexts. 22

6.7 Data Visualization 23

6.8 Data Visualization of Mean ,Median ,Mode Values 23

vii
6.9 Distribution of text length for positive sentiment 24
tweets
6.10 Distribution of text length for negative sentiment 24
tweets
6.11 Pie Chart Of Different Sentiments of tweets 25

6.12 Word Cloud 25

6.13 Word Count 26

6.14 Model Summary 26

6.15 TensorFlow Model Diagram 27

6.16 Confusion Matrix for True Label VS Predicted Label 28


for Bidirectional LSTM
6.17 Model Accuracy And Model Loss Graph 29

6.18 Predicted Outputs 29

viii
LIST OF TABLES

TABLE TABLE NAME PAGE NO.


NO.

6.1 Performance of Stationary dataset 20

ix
LIST OF ABBREVIATIONS

LSTM Long Short-Term Memory

FP False Positive

TP True Positive

NLP Natural Language Processing

FN False Negative

False Positive
FP

NN Neural Network

x
LIST OF SYMBOLS
NOTATION MEANING
X Dataset
𝑤𝑖 Weights
P Precision
F F1-Score
L Labeling
R Recall

xi
CHAPTER 1

INTRODUCTION

1.1 SENTIMENT ANALYSIS

Sentiment analysis, also referred to as opinion mining, is an approach to natural


language processing (NLP) that identifies the emotional tone behind a body of text. This
is a popular way for organizations to determine and categorize opinions about a product,
service, or idea. Sentiment analysis focuses on the polarity of a text (positive, negative,
neutral) but it also goes beyond polarity to detect specific feelings and emotions (angry,
happy, sad, etc) and even intentions. Many emotion detection systems use lexicons (i.e.
lists of words and the emotions they convey) or complex machine learning algorithms.

1.2 NLP

Natural language processing (NLP) refers to the branch of computer science—


and more specifically, the branch of artificial intelligence or AI—concerned with giving
computers the ability to understand text and spoken words in much the same way human
beings can.

1.2.1 TEXTBLOB

TextBlob is a Lexicon-based sentiment analyzer It has some predefined rules,


where it has some scores that help to calculate a sentence's polarity. That's why the
Lexicon-based sentiment analyzers are also called “Rule-based sentiment analyzers”.
TextBlob is a Python library for processing textual data. It provides a simple API for
diving into common natural language processing (NLP) tasks such as part-of-speech
tagging, noun phrase extraction, sentiment analysis, classification, translation, and
more.

1.3 STEPS OF NLP

The Steps of NLP includes:


• Tokenization

• Lemmatization
1
• Stemming
• POS Tagging
• NER

Figure 1.1 Steps of NLP

1.4 TYPES OF NLP


There are many different natural language processing algorithms, but two main
types are commonly used is

Rule-based system:
This system uses carefully designed linguistic rules. This approach was used
early on in the development of natural language processing, and is still used.

Machine learning-based system:


Machine learning algorithms use statistical methods. They learn to perform tasks
based on training data they are fed, and adjust their methods as more data is processed.

2
Using a combination of machine learning, deep learning and neural networks, natural
language processing algorithms hone their own rules through repeated processing and
learning.

1.5 HOW DOES NLP WORK?


Natural Language Processing (NLP) stands at the intersection of linguistics,
computer science, and artificial intelligence, striving to bridge the gap between human
communication and machine understanding. At its core, NLP aims to equip computers
with the ability to comprehend and interpret human language in all its nuances,
including spoken and written forms. This involves breaking down complex linguistic
structures, deciphering meaning from context, and generating appropriate responses or
actions. Through a combination of computational algorithms, machine learning
techniques, and linguistic theories, NLP systems analyze language patterns, extract
relevant information, and make informed decisions, mirroring human cognitive
processes to a certain extent.

The journey of NLP begins with input acquisition, where raw text or speech data is
collected from various sources. This data undergoes preprocessing, a crucial step that
involves cleaning, formatting, and organizing the input to prepare it for further analysis.
Techniques like tokenization, where text is segmented into meaningful units like words
or sentences, and normalization, which standardizes text to a consistent format, help
streamline the processing pipeline. Following preprocessing, the language
understanding phase takes center stage, where NLP models delve into syntactic,
semantic, and discourse analysis to grasp the underlying meaning of the text. Syntax
analysis parses the grammatical structure, semantic analysis deciphers word meanings
and relationships, while discourse analysis considers the broader context to ensure
coherence and relevance.

Feature extraction serves as a bridge between linguistic understanding and machine


learning, where relevant linguistic features are transformed into numerical .
3
CHAPTER 2
LITERATURE SURVEY

2.1 EXISTING SYSTEM

The existing system involves the analysis of sentiments expressed on Twitter, a


widely used microblogging platform with over 330 million users worldwide. Sentiment
analysis, also known as opinion mining, is conducted to determine the polarity or type
of opinion conveyed in tweets. Natural Language Processing (NLP) techniques such as
tokenization, elimination of stop words, and stemming are utilized to support the
analysis process. The system focuses on developing sentiment analysis using lexicons
and multiplication polarity, although the accuracy results are reported to be lower
compared to machine learning approaches. The study acknowledges the need for
improvements in lexicon semantics to enhance accuracy.

In recent years, the use of social networking sites has significantly increased, generating
vast amounts of data as users express their views and opinions. This paper discusses
sentiment extraction from Twitter, where users post their opinions and views on various
topics. Sentiment analysis is conducted on tweets to provide insights for business
intelligence purposes. The system employs the Hadoop Framework to process movie
dataset available on Twitter, including reviews, feedback, and comments. Results of
sentiment analysis are presented categorically, highlighting positive, negative, and
neutral sentiments. The paper emphasizes the importance of sentiment analysis in
understanding public opinions, especially on social media platforms where individuals
frequently seek reviews and opinions on various subjects. Natural Language Processing
(NLP) techniques are utilized to analyze tweets and determine people's thoughts on
specific topics.

The analysis of sentiments helps in discerning whether the sentiment expressed in text
is positive, negative, or neutral, providing valuable insights into societal sentiments.

4
2.2 RELATED WORKS
2.2.1 Summary of Related Works:
Sentiment Analysis on Twitter Data:
This study explores sentiment analysis techniques applied to Twitter data,
recognizing the significance of understanding public opinions on social media platforms.
It utilizes Natural Language Processing (NLP) techniques such as tokenization,
elimination of stop words, and stemming to analyze sentiments expressed in tweets. The
system employs lexicons and multiplication polarity for sentiment classification, with a
focus on improving semantic accuracy. The research provides insights into the
methodologies and challenges associated with sentiment analysis on Twitter.

Hadoop Framework for Twitter Data Processing:


In this work, the emphasis lies on leveraging the Hadoop Framework for
processing Twitter data, particularly for sentiment analysis purposes. With the
exponential growth of social media data, efficient processing methods are essential. The
study underscores the need for scalable and distributed processing solutions to handle
large volumes of Twitter data effectively. By utilizing the Hadoop Framework, the
system aims to extract insights from Twitter data to inform business intelligence
decisions.

Comparison of Machine Learning Methods for Sentiment Analysis:


This research compares various machine learning algorithms for sentiment
analysis on Twitter data. Acknowledging the importance of understanding societal
sentiments expressed on social media platforms like Twitter, the study evaluates the
performance of classification algorithms, including Naïve Bayes, Support Vector
Machine, and Maximum Entropy. The research aims to identify the most effective
approach for sentiment analysis on Twitter by assessing the accuracy and precision of
each method, providing insights into the strengths and limitations of different machine
learning techniques.
5
Deep Learning Approaches for Sentiment Analysis:
Another related work delves into deep learning approaches for sentiment
analysis on Twitter data. Recognizing the complexity of natural language
understanding, the study explores the application of deep learning architectures such as
recurrent neural networks (RNNs) and convolutional neural networks (CNNs) for
sentiment classification. By leveraging the hierarchical representation learning
capabilities of deep learning models, the system aims to capture nuanced sentiments
expressed in tweets and improve sentiment analysis accuracy.

Social Network Analysis for Sentiment Mining:


This study investigates social network analysis techniques for sentiment mining on
Twitter. Beyond analyzing individual tweets, the research focuses on understanding the
collective sentiment dynamics within social networks. By examining user interactions,
network structures, and influence patterns, the study aims to uncover underlying
sentiment trends and group sentiments. The research highlights the importance of
considering the social context in sentiment analysis and provides insights into
leveraging network-based approaches for deeper sentiment understanding.

2.2 PROPOSED SYSTEM


In addressing the imperative for precise sentiment analysis on Twitter data, this
proposed system introduces a Bi-directional LSTM (Long Short-Term Memory) Neural
Network. Unlike conventional LSTM networks, Bi-directional LSTM networks possess
the capability to capture context from both past and future inputs, enabling a more
comprehensive understanding of sequential data like tweets.

Bi-directional LSTM Architecture:


The system's architecture involves encoding tweet text using word embeddings
and passing it through both forward and backward LSTM layers. This ensures that the
network captures dependencies in both directions, enhancing its ability to grasp the
6
context of each tweet effectively. The forward LSTM calculates hidden states based on
past inputs, while the backward LSTM calculates hidden states based on future inputs.

Backward LSTM Work Flow:


Similar to the forward LSTM, but using backward inputs.

Figure 2.2 Flow Diagram For Bi-directional LSTM

Training Procedure:
The system's training process involves feeding the preprocessed tweet data into
the Bi-directional LSTM Neural Network and adjusting the network's parameters using
backpropagation and gradient descent. Dropout regularization is applied during training
to prevent overfitting. Hyperparameters such as learning rate, batch size, and dropout
rate are optimized to enhance model performance.

7
CHAPTER 3
SYSTEM STUDY

3.1 SCOPE:
The scope of Sentiment Analysis of Twitter Data : A Case Study is to analyze
the sentiments expressed by users on Twitter and classify the tweets according to their
sentiments(positive, negative or neutral) using NLP techniques.

3.2 PRODUCT FUNCTION:


 We have used Tweepy Open source package for extracting data from Twitter.
Approximately 13,000 tweets have been scrapped using tweepy package.
 We have labelled the data using TextBlob and VADER which is used to classify the
tweets as Positive, Negative or Neutral tweets.
 Once, We have labelled the tweets using TextBlob and VADER we plotted graph
and counted how much Positve, Negative or Netural tweets are present.
 After finishing the data labelling, the data we have collected may hold some
unsought and sentiment fewer words like links, Twitter-specific words such as
hashtags (starts with #) and tags (starts with @), single letter words, numbers, etc.
These types of words can play the role of noise in our classifier training and testing.
To amend classifier efficiency, it is necessary to remove noise from the labeled data
set before feeding the classifier
 Our pre-processing module separates noise from the labeled data set.
 The steps of pre-processing are shown above. In this step, we implemented a module
to remove the above-specified impurities, converted the data set into a data frame,
and then executed removal of string punctuations, tokenization, and removal of
English stop words, stemming, and lemmatization
 The matrix is not sparse because we are converting the only single document.
 In the case of multiple documents, it is frequent that a word present in one document
can be missing from some other documents, and hence the corresponding cells are

8
filled up with zero, and the resultant matrix will become sparse.

 After feature extraction of the preprocessed data set, we have passed the data to
machine learning classifiers
 We have used bi-directional LSTM for this purpose.
 We have used 80% data for training and 20% data for testing the classifiers.

3.3 SYSTEM REQUIREMENTS

The software and hardware requirements of the system are as follows:

3.3.1 HARDWARE INTERFACES

 Intel® CoreTM i5-8265U 1.6GHz


 8 GB RAM

3.3.2 SOFTWARE INTERFACES


 Platform – Anaconda
 IDE – Spyder
 Technologies used – Python
 API– Snscraper
 Google Colab

3.3.2.1 ANACONDA
Anaconda platform is used for machine learning and any other large-scale data
processing. It is an open source distribution which is capable of working with R and
Python programming languages and free of cost. It consists of more than 1500 packages
and virtual environment manager. The virtual environmental manager is named as
Anaconda Navigator and it comprises all the libraries to be installed within.

3.3.2.2 SPYDER

To implement the proposed system the IDE used is Spyder environment. It is an


open source cross-platform integrated development environment. It is the
combination of advanced features such as debugging, editing, and analysis of huge
data. This tool helps in interactive execution, data exploration and visualization .
9
3.3.2.3 PYTHON

Python is an interpreter, high-level data structures, general-purpose


programming language. It can be used for creating web applications on server side.
Python is also suitable as an extension language for customized applications.

3.3.2.4 TWEEPY
Tweepy is a Python library that simplifies the process of interacting with the
Twitter API, offering easy authentication handling, versatile API functionality, flexible
tweet handling, rich user interaction, real-time streaming capabilities, robust error
handling and rate limiting, comprehensive documentation, and strong community
support. It serves as an indispensable tool for developers seeking to integrate Twitter
functionality into their Python applications, enabling them to access Twitter data and
functionality efficiently and effectively.

3.3.2.5 GOOGLE COLAB


Colaboratory, or “Colab” for short, is a product from Google Research. Colab
allows anybody to write and execute arbitrary python code through the browser, and is
especially well suited to machine learning, data analysis and education. Colab is a free
Jupyter notebook environment that runs entirely in the cloud. Most importantly, it does
not require a setup and the notebooks that you create can be simultaneously edited by
your team members - just the way you edit documents in Google Docs.

10
CHAPTER 4
SYSTEM DESIGN

4.1 OVERVIEW
This section presents the overview of the whole system. The Section 4.2 shows
the system Section 4.2 defines the main three modules used Section 4.3.1 defines how
clusters are formed with the unbounded data streams Section 4.3.2 defines the merge
operation with the previously formed rough clusters Section 4.3.3 describes how the
clusters are categorized and stored offline.

4.2 OVERALL ARCHITECTURE:

Figure 4.1 System architecture

4.3 MODULES

 Neural Network
 Bi-Directional LSTM

11
4.3.1 BIDIRECTIONAL LSTM

Bidirectional LSTM (Long Short-Term Memory) is a sophisticated type of


recurrent neural network architecture designed to process sequential data in both
forward and backward directions. This bidirectional processing capability distinguishes
it from traditional LSTMs, which only consider past information. By incorporating
future context as well, bidirectional LSTMs offer a comprehensive understanding of the
entire sequence, making them highly effective for tasks where capturing dependencies
across the entire sequence is crucial.

Figure 4.2 Bidirectional LSTM Flow Diagram

12
Figure 4.3 LSTM Sigmoid And Computation Formula

4.3.2 NEURAL NETWORK


A neural network is a powerful machine learning model inspired by the structure
and function of the human brain. It consists of interconnected nodes organized in layers,
each performing computations on the input data. One of the classic types of neural
networks is the feedforward neural network, commonly used for various tasks including
classification, regression, and pattern recognition.

The architecture of a feedforward neural network typically includes an input


layer, one or more hidden layers, and an output layer. Each layer is composed of
neurons, also known as nodes, which receive input signals, apply a transformation
function, and pass the result to the next layer.

13
Figure 4.4 Neural Network Diagram

Figure 4.5 Neural Network Work Flow Diagram

14
CHAPTER 5
IMPLEMENTATION METHODOLOGY

5.1 OVERVIEW:
The implementation methodology describes the main functional requirements,
which needed for doing the project.

5.2 ESSENTIAL LIBRARIES:


The library used in this project are pandas, tweepy, textblob, nltk,countvectorizer,
matplotlib, snowballstemmer.

5.2.1 PANDAS:
Pandas is a Python package providing fast, flexible, and expressive data
structures designed to make working with “relational” or “labeled” data both easy and
intuitive. It aims to be the fundamental high-level building block for doing practical,
real-world data analysis in Python.

5.2.2 TEXTBLOB:
TextBlob is a Python (2 and 3) library for processing textual data. It provides a
simple API for diving into common natural language processing (NLP) tasks such as
part-of-speech tagging, noun phrase extraction, sentiment analysis, classification,
translation, and more.
5.2.3 NLTK:
NLTK stands for Natural Language ToolKit, which is a toolkit build for working
with NLP in Python. It provides us various text processing libraries with a lot of test
datasets. A variety of tasks can be performed using NLTK such as tokenizing, parse tree
visualization, etc…
5.2.4 MATPLOTLIB:
Matplotlib is one of the most common packages used for data visualization in
python. It is a cross-platform library for creating 2-Dimensional plots from data in
arrays.
15
MATPLOTLIB.PYPLOT:
The matplotlib.pyplot is a collection of command style functions that make
matplotlib work like MATLAB. Each pyplot function makes some change to a figure:
e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting
area, decorates the plot with labels, etc. In this project matplotlib.pyplot is used to
generate graph such as Bar graph, Cartesian graphs which represents the different
Parameters Vs performance of the algorithm.

5.2.4 SNOWBALLSTEMMER:
It is a stemming algorithm which is also known as the Porter2 stemming
algorithm as it is a better version of the Porter Stemmer since some issues of it were
fixed in this stemmer.

5.3 FUNCTIONS USED FOR IMPLEMENTATION:


The user defined function used for the implementation of the project are

5.3.1 DATA EXTRACT:


In this method, the data is extracted using tweepy which gives request to the
twitter API and the twitter will send a response as tweets. We have extracted the tweets
from the twitter using snscrapper for 5000 tweets and used no of datasets from Kaggle.

5.3.2 CLEAN TWEET:


The tweet we collected may have noisy data for removing that we used the
method clean tweet. This method will remove the links, @mentions, RT, punctuations,
etc. to have a structed tweets will meaningful information.

5.3.3 SUBJECTIVITY:
The Subjectivity methods is used inside the textblob. Subjectivity is the output
16
that lies within [0,1] and refers to personal opinions and judgments.

5.3.4 POLARITY:
In this method, textblob uses a polarity to classify the tweets and the polarity
range between -1 to +1. The polarity classify the tweets as positive, negative and netural
tweets.

5.3.5 TOKENIZATION:
Word tokenization is the done for splitting a large sample of text into words.
This is a requirement in natural language processing tasks where each word needs to be
captured and subjected to further analysis like classifying and counting them for a
particular sentiment etc.

5.3.6 STEMMING:
Stemming is the process of reducing inflection in words to their root forms such
as mapping a group of words to the same stem even if the stem itself is not a valid word
in the Language.

17
CHAPTER 6
6.1 PERFORMANCE METRICS
6.1.1 OVERVIEW:
Our algorithm is evaluated across three metrics: 1) confusion matrix 2) F1- Score
3) Precision 4) Recall. The performance metrics of the collected tweets is compared with
eight important supervised machine learning algorithms that is Bi-Directional Long
Short Term Memory Neural Network Algorithm.

6.1.2 CONFUSION MATRIX:

It is the easiest way to measure the performance of a classification problem


where the output can be of two or more type of classes. A confusion matrix is nothing
but a table with two dimensions viz. “Actual” and “Predicted” and furthermore, both
the dimensions have “True Positives (TP)”, “True Negatives (TN)”, “False Positives
(FP)”, “False Negatives (FN)” as shown below –

Figure 6.1 Confusion Matrix

Explanation of the terms associated with confusion matrix are as follows −

• True Positives (TP) − It is the case when both actual class & predicted class of data point is 1.

18
• True Negatives (TN) − It is the case when both actual class & predicted class of data

point is 0.

• False Positives (FP) − It is the case when actual class of data point is 0 & predicted

class of data point is 1.


• False Negatives (FN) − It is the case when actual class of data point is 1 & predicted

class of data point is 0.

6.1.3 F1-SCORE:
This score will give us the harmonic mean of precision and recall.
Mathematically, F1 score is the weighted average of the precision and recall. The best
value of F1 would be 1 and worst would be 0. We can calculate F1 score with the help
of following formula –
F1 = 2 * ((Precision * Recall) / Precision + Recall )

6.1.4 PRECISION:
Precision, used in document retrievals, may be defined as the number of correct
documents returned by our ML model. We can easily calculate it by confusion matrix
with the help of following formula –

Precision=TP/TP+FP
`Where,
TP - True
Positive FP-
False Positive

6.1.5 RECALL:
Recall may be defined as the number of positives returned by our ML model.
We can easily calculate it by confusion matrix with the help of following formula −
19
Where,
TP - True Positive
FN - False Negative

6.2 RESULTS AND DISCUSSION

6.2.1 OVERVIEW
This chapter explains the result of our project and the screenshots for each
step are included and explained

6.2.2DATASETS:

The datasets used in our projects are:

6.2.2.1 STATIONARY DATASET:


A Stationary dataset is one whose statistical properties such as the mean, variance and
autocorrection are all constant over time.

DATASETS CLASSES FEATURES EXAMPLES

twitter-and-reddit- 2 4 8,000
sentimental-
analysis-dataset

appletwittersentim 2 8 5787
enttexts

twitterdata 2 13 10,587

twitter-sentiment- 2 7 6,347
dataset

Table 6.1 Stationary Dataset

20
6.2.3 SCREENSHOTS

Figure 6.2 Data Preprocessing Of twitter-airline-sentiment

Figure 6.3 Data Preprocessing Of twitter-and-reddit-sentimental-analysis-dataset

21
Figure 6.4 Data Preprocessing Of twitter-data

Figure 6.5 Data Preprocessing Of appletwittersentimenttexts

22
Figure 6.6 Data visualization is the graphical representation of data to facilitate
understanding, analysis, and communication of insights.

Figure 6.7 Data Visualization of Mean ,Median ,Mode Values For Twitter Data(Tweets)

23
Figure 6.8 Distribution of text attributes for biclass of the data

Figure 6.9 Distribution of text attributes for biclass of the data


24
Figure 6.10 Pie Chart Of Different Sentiments of tweets

Figure 6.11 Word Cloud of Text to vector Representation

25
Figure 6.12 Word Count

Figure 6.13 Model Summary

26
Figure 6.14 TensorFlow Model Diagram

27
Figure 6.15 Confusion Matrix of True Labels for Bi-Directional LSTM(NN)

28
Figure 6.16 Model Accuracy And Model Loss Graph

Figure 6.17 Predicted Outputs for Sample Examples

29
CHAPTER 7
CONCLUSION
7.1 CONCLUSION
Social media is witnessing a massive increase in the number of users per day.
People prefer to share their honest opinions on social media instead of sharing with
someone in person. Using the posts from Twitter, we examined the common public’s
aggregate reaction toward various topics. Motivated by the mixed reactions coming
after certain events on Twitter, we collected tweets during specific time periods. We
have applied collected data to bi-directional LSTM for annotation and preprocessing.
We have observed the best performance with the bi-directional LSTM and unigram.
The combination gives us an accuracy of 95.4%, which is best in all the combinations
which we have executed on our data set. We have consolidated the performance by
calculating precision, recall, F1-Score, and tenfold cross-validation for all the
combinations, and we got the best results with bi-directional LSTM and unigram. So,
we executed the sentiment analysis of tweets by the public during specific events using
this combination and found that almost half of the population (42.8%) is talking positive
about the topic, 33.1% are neutral, and 24.1% of the people are feeling negative due to
some reason.

7.2 FUTURE WORK


"In Future Work, we have planned to extract trending tweets on Twitter and
analyze the sentiments using NLP. We aim to improve accuracy further by
implementing supervised machine learning algorithms. Additionally, as a future
enhancement, we will incorporate image sentiment analysis to provide a more
comprehensive understanding of user sentiments across different media types."

30
APPENDIX
CODE:
!pip install opendatasets
import numpy as np # linear algebra
import pandas as pd # data processing
import os
import tweepy as tw #for accessing Twitter API
import opendatasets as od

#For Preprocessing
import re # RegEx for removing non-letter characters
import nltk #natural language processing
nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.stem.porter import *

# For Building the model


from sklearn.model_selection import train_test_split
import tensorflow as tf
import seaborn as sns

#For data visualization


import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
%matplotlib inline

pd.options.plotting.backend = "plotly"

31
od.download("https://www.kaggle.com/datasets/cosmos98/twitter-and-reddit-
sentimental-analysis-dataset")

od.download("https://www.kaggle.com/datasets/seriousran/appletwittersentimenttexts
")
od.download("https://www.kaggle.com/datasets/surajkum1198/twitterdata")
od.download("https://www.kaggle.com/datasets/crowdflower/twitter-airline-
sentiment")
od.download("https://www.kaggle.com/datasets/saurabhshahane/twitter-sentiment-
dataset")

# Load Tweet dataset


df1 = pd.read_csv('/content/input/twitter-and-reddit-sentimental-analysis-
dataset/Twitter_Data.csv')
# Output first five rows
df1.head()

# Load Tweet dataset


df2 = pd.read_csv('/content/input/appletwittersentimenttexts/apple-twitter-sentiment-
texts.csv')
df2 = df2.rename(columns={'text': 'clean_text', 'sentiment':'category'})
df2['category'] = df2['category'].map({-1: -1.0, 0: 0.0, 1:1.0})
# Output first five rows

df2.head()

# Load Tweet dataset


df3 = pd.read_csv('/content/input/twitterdata/finalSentimentdata2.csv')
df3 = df3.rename(columns={'text': 'clean_text', 'sentiment':'category'})
df3['category'] = df3['category'].map({'sad': -1.0, 'anger': -1.0, 'fear': -1.0, 'joy':1.0})
df3 = df3.drop(['Unnamed: 0'], axis=1)

32
# Output first five rows
df3.head()

# Load Tweet dataset


df4 = pd.read_csv('/content/input/twitter-airline-sentiment/Tweets.csv')
df4 = df4.rename(columns={'text': 'clean_text', 'airline_sentiment':'category'})
df4['category'] = df4['category'].map({'negative': -1.0, 'neutral': 0.0, 'positive':1.0})
df4 = df4[['category','clean_text']]
# Output first five rows
df4.head()

df = pd.concat([df1, df2, df3, df4, df5], ignore_index=True)

# Check for missing data


df.isnull().sum()

# drop missing rows


df.dropna(axis=0, inplace=True)

# dimensionality of the data


df.shape

# Map tweet categories


df['category'] = df['category'].map({-1.0:'Negative', 0.0:'Neutral', 1.0:'Positive'})

# Output first five rows


df.head()

# The distribution of sentiments

33
df.groupby('category').count().plot(kind='bar')

# Calculate tweet lengths


tweet_len = pd.Series([len(tweet.split()) for tweet in df['clean_text']])

# The distribution of tweet text lengths


tweet_len.plot(kind='box')

fig = plt.figure(figsize=(14,7))
df['length'] = df.clean_text.str.split().apply(len)
ax1 = fig.add_subplot(122)
sns.histplot(df[df['category']=='Positive']['length'], ax=ax1,color='green')
describe = df.length[df.category=='Positive'].describe().to_frame().round(2)

ax2 = fig.add_subplot(121)
ax2.axis('off')
font_size = 14
bbox = [0, 0, 1, 1]
table = ax2.table(cellText = describe.values, rowLabels = describe.index, bbox=bbox,
colLabels=describe.columns)
table.set_fontsize(font_size)
fig.suptitle('Distribution of text length for positive sentiment tweets.', fontsize=16)

plt.show()

fig = plt.figure(figsize=(14,7))
df['length'] = df.clean_text.str.split().apply(len)
ax1 = fig.add_subplot(122)
sns.histplot(df[df['category']=='Negative']['length'], ax=ax1,color='red')

34
describe = df.length[df.category=='Negative'].describe().to_frame().round(2)

ax2 = fig.add_subplot(121)
ax2.axis('off')
font_size = 14
bbox = [0, 0, 1, 1]
table = ax2.table(cellText = describe.values, rowLabels = describe.index, bbox=bbox,
colLabels=describe.columns)
table.set_fontsize(font_size)
fig.suptitle('Distribution of text length for Negative sentiment tweets.', fontsize=16)

plt.show()

import plotly.express as px
fig = px.pie(df, names='category', title ='Pie chart of different sentiments of tweets')
fig.show()

df.drop(['length'], axis=1, inplace=True)


df.head

#### Visualizing data into wordclouds

from wordcloud import WordCloud, STOPWORDS

def wordcount_gen(df, category):


'''
Generating Word Cloud
inputs:

35
- df: tweets dataset
- category: Positive/Negative/Neutral
'''
# Combine all tweets
combined_tweets = " ".join([tweet for tweet in
df[df.category==category]['clean_text']])

# Initialize wordcloud object


wc = WordCloud(background_color='white',
max_words=60,
stopwords = STOPWORDS)

# Generate and plot wordcloud


plt.figure(figsize=(10,10))
plt.imshow(wc.generate(combined_tweets))
plt.title('{} Sentiment Words'.format(category), fontsize=20)
plt.axis('off')
plt.show()
print("\n\n")

# Positive tweet words


wordcount_gen(df, 'Positive')

# Negative tweet words


wordcount_gen(df, 'Negative')

# Neutral tweet words


wordcount_gen(df, 'Neutral')

36
def tweet_to_words(tweet):
''' Convert tweet text into a sequence of words '''

# convert to lowercase
text = tweet.lower()
# remove non letters
text = re.sub(r"[^a-zA-Z0-9]", " ", text)
# tokenize
words = text.split()
# remove stopwords
words = [w for w in words if w not in stopwords.words("english")]
# apply stemming
words = [PorterStemmer().stem(w) for w in words]
# return list
return words

print("\nOriginal tweet ->", df['clean_text'][0])


print("\nProcessed tweet ->", tweet_to_words(df['clean_text'][0]))

# Apply data processing to each tweet


X = list(map(tweet_to_words, df['clean_text']))

from sklearn.preprocessing import LabelEncoder

# Encode target labels


le = LabelEncoder()
Y = le.fit_transform(df['category'])

y = pd.get_dummies(df['category'])

37
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25,
random_state=1)

from sklearn.feature_extraction.text import CountVectorizer


#from sklearn.feature_extraction.text import TfidfVectorizer

vocabulary_size = 5000

# Tweets have already been preprocessed hence dummy function will be passed in
# to preprocessor & tokenizer step
count_vector = CountVectorizer(max_features=vocabulary_size,
# ngram_range=(1,2), # unigram and bigram
preprocessor=lambda x: x,
tokenizer=lambda x: x)
#tfidf_vector = TfidfVectorizer(lowercase=True, stop_words='english')

# Fit the training data


X_train = count_vector.fit_transform(X_train).toarray()

# Transform testing data


X_test = count_vector.transform(X_test).toarray()

# Plot the BoW feature vector


plt.plot(X_train[2,:])
plt.xlabel('Word')
plt.ylabel('Count')
plt.show()

38
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

max_words = 5000
max_len=50

def tokenize_pad_sequences(text):
'''
This function tokenize the input text into sequnences of intergers and then
pad each sequence to the same length
'''
# Text tokenization
tokenizer = Tokenizer(num_words=max_words, lower=True, split=' ')
tokenizer.fit_on_texts(text)
# Transforms text to a sequence of integers
X = tokenizer.texts_to_sequences(text)
# Pad sequences to the same length
X = pad_sequences(X, padding='post', maxlen=max_len)
# return sequences
return X, tokenizer

print('Before Tokenization & Padding \n', df['clean_text'][0])


X, tokenizer = tokenize_pad_sequences(df['clean_text'])
print('After Tokenization & Padding \n', X[0])

import pickle

# saving
with open('tokenizer.pickle', 'wb') as handle:

39
pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

# loading
with open('tokenizer.pickle', 'rb') as handle:
tokenizer = pickle.load(handle)

y = pd.get_dummies(df['category'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25,
random_state=1)
print('Train Set ->', X_train.shape, y_train.shape)
print('Validation Set ->', X_val.shape, y_val.shape)
print('Test Set ->', X_test.shape, y_test.shape)

import keras.backend as K

def f1_score(precision, recall):


''' Function to calculate f1 score '''

f1_val = 2*(precision*recall)/(precision+recall+K.epsilon())
return f1_val

from keras.models import Sequential


from keras.layers import Embedding, Conv1D, MaxPooling1D, Bidirectional, LSTM,
Dense, Dropout
from keras.metrics import Precision, Recall
from keras.optimizers import SGD
from keras.optimizers import RMSprop
from keras import datasets

40
from keras.callbacks import LearningRateScheduler
from keras.callbacks import History

from keras import losses

vocab_size = 5000
embedding_size = 32
epochs=20
learning_rate = 0.1
decay_rate = learning_rate / epochs
momentum = 0.8

sgd = SGD(learning_rate=learning_rate, momentum=momentum, nesterov=False)


# Build model
model= Sequential()
model.add(Embedding(vocab_size, embedding_size, input_length=max_len))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Bidirectional(LSTM(32)))
model.add(Dropout(0.4))
model.add(Dense(3, activation='softmax'))

import tensorflow as tf
tf.keras.utils.plot_model(model, show_shapes=True)

print(model.summary())

41
# Compile model
model.compile(loss='categorical_crossentropy', optimizer=sgd,
metrics=['accuracy', Precision(), Recall()])

# Train model

batch_size = 64
history = model.fit(X_train, y_train,
validation_data=(X_val, y_val),
batch_size=batch_size, epochs=epochs, verbose=1)

# Evaluate model on the test set


loss, accuracy, precision, recall = model.evaluate(X_test, y_test, verbose=0)
# Print metrics
print('')
print('Accuracy : {:.4f}'.format(accuracy))
print('Precision : {:.4f}'.format(precision))
print('Recall : {:.4f}'.format(recall))
print('F1 Score : {:.4f}'.format(f1_score(precision, recall)))

def plot_training_hist(history):
'''Function to plot history for accuracy and loss'''

fig, ax = plt.subplots(1, 2, figsize=(10,4))


# first plot
ax[0].plot(history.history['accuracy'])
ax[0].plot(history.history['val_accuracy'])
ax[0].set_title('Model Accuracy')
ax[0].set_xlabel('epoch')

42
ax[0].set_ylabel('accuracy')
ax[0].legend(['train', 'validation'], loc='best')
# second plot
ax[1].plot(history.history['loss'])
ax[1].plot(history.history['val_loss'])
ax[1].set_title('Model Loss')
ax[1].set_xlabel('epoch')
ax[1].set_ylabel('loss')
ax[1].legend(['train', 'validation'], loc='best')

plot_training_hist(history)

from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(model, X_test, y_test):


'''Function to plot confusion matrix for the passed model and the data'''

sentiment_classes = ['Negative', 'Neutral', 'Positive']


# use model to do the prediction
y_pred = model.predict(X_test)
# compute confusion matrix
cm = confusion_matrix(np.argmax(np.array(y_test),axis=1), np.argmax(y_pred,
axis=1))
# plot confusion matrix
plt.figure(figsize=(8,6))
sns.heatmap(cm, cmap=plt.cm.Blues, annot=True, fmt='d',
xticklabels=sentiment_classes,
yticklabels=sentiment_classes)
plt.title('Confusion matrix', fontsize=16)

43
plt.xlabel('Actual label', fontsize=12)
plt.ylabel('Predicted label', fontsize=12)

plot_confusion_matrix(model, X_test, y_test)

# Save the model architecture & the weights


model.save('best_model.h5')
print('Best model saved')

from keras.models import load_model

# Load model
model = load_model('best_model.h5')

def predict_class(text):
'''Function to predict sentiment class of the passed text'''

sentiment_classes = ['Negative', 'Neutral', 'Positive']


max_len=50

# Transforms text to a sequence of integers using a tokenizer object


xt = tokenizer.texts_to_sequences(text)
# Pad sequences to the same length
xt = pad_sequences(xt, padding='post', maxlen=max_len)
# Do the prediction using the loaded model
yt = model.predict(xt).argmax(axis=1)
# Print the predicted sentiment
print('The predicted sentiment is', sentiment_classes[yt[0]])

44
REFERENCES
REFERENCES
[1] R. Liu, Y. Shi, C. Jia, and M. Jia, “A survey of sentiment analysis based on transfer
learning,” IEEE Access, vol. 7, pp. 85401–85412, 2019.
[2] B. R. Naiknaware and S. S. Kawathekar, “Prediction of 2019 Indian election using
sentiment analysis,” in Proc. 2nd Int. Conf., Aug. 2018, pp. 660–665.
[3] D. D. Wu, L. Zheng, and D. L. Olson, “A decision support approach for online stock
forum sentiment analysis,” IEEE Trans. Syst., Man, Cybern. Syst., vol. 44, no. 8, pp.
1077–1087, Aug. 2014.
[4] J. Ding, H. Sun, X. Wang, and X. Liu, “Entity-level sentiment analysis of issue
comments,” in Proc. 3rd Int. Workshop Emotion Awareness Softw. Eng., Jun. 2018, pp.
7–13.
[5] M. Pota, M. Esposito, M. A. Palomino, and G. L. Masala, “A subwordbased deep
learning approach for sentiment analysis of political tweets,” in Proc. 32nd Int. Conf.
Adv. Inf. Netw. Appl. Workshops (WAINA), May 2018, pp. 651–656.
[6] X. Wang, F. Wei, X. Liu, M. Zhou and M. Zhang, "Topic sentiment analysis in
twitter: a graph-based hashtag sentiment classification approach", Proceedings of the
20th International Conference on Information Knowledge Management, pp. 1031-
1040, 2011.
[7] A. Quazi and M. K. Srivastava, "Twitter sentiment analysis using machine learning"
in VLSI Microwave and Wireless Technologies, Singapore:Springer Nature Singapore,
pp. 379-389, 2023.
[8] J. Ferdoshi and J. Ferdoshi, "Dataset for twitter sentiment analysis using roberta and
vader", Mendeley Data, 2023.
[9] S. Jawale, "Twitter sentiment analysis", INTERANTIONAL JOURNAL OF
SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT, vol. 07, 04 2023.
[10] J. Lee et al., "Health information technology trends in social media: Using twitter
data", Healthc. Inform. Res., vol. 25, no. 2, pp. 99-105, 2019.

45

You might also like