Project Report
Project Report
Project Report
Project Report
on
SENTIMENT ANALYSIS OF TWITTER DATA SET
Bachelor of Technology
in
Computer Science and Engineering
By
BHARAT SINGH 1709710035
CHOUDHARY RISHAB KUMAR 1709710038
PRASHANT RAJ 1709710080
I
GALGOTIAS COLLEGE OF ENGINEERING & TECHNOLOGY
GREATER NOIDA, UTTER PRADESH, INDIA- 2 0 1 3 0 6 .
CERTIFICATE
II
GALGOTIAS COLLEGE OF ENGINEERING & TECHNOLOGY
GREATER NOIDA, UTTER PRADESH, INDIA- 2 0 1 3 0 6 .
ACKNOWLEDGEMENT
We have taken efforts in this project. However, it would not have been possible
without the kind support and help of many individuals and organizations. We would
like to extend my sincere thanks to all of them.
We are highly indebted to DR. RITESH SRIVASTAVA for their guidance and
constant supervision as well as for providing necessary information regarding the
project & also for their support in completing the project.
We also express gratitude towards our parents for their kind co-operation and
encouragement which help me in completion of this project. Our thanks and
appreciations also go to our friends in developing the project and people who have
willingly helped me out with their abilities.
BHARAT SINGH
PRASHANT RAJ
III
ABSTRACT
This project aims to present an overview of sentimental analysis. Sentimental
analysis works in the background of many industries. Today the industries have
become more considerate to their customer’s demands, likes, dislikes, opinions,
responses etc. The bright minds of the consumer market have realized the power of
opinions. The producers have learned to respect the needs and ideas of their
customers, as they know in this date of neck-to-neck competitions, their customers
have the power to make and break them. Keeping an eye on the customer response
has become a crucial part of production system. Sentiment analysis has come out as
a powerful tool to make this process easier and much faster. Sentiment analysis
involves collection of data, pre-processing and polarity detection.
IV
TABLE OF CONTENTS
Title Page
CERTIFICATE ii
ACKNOWLEDGEMENT iii
ABSTRACT iv
CONTENTS v
LIST OF TABLES vi
LIST OF FIGURES vii
LIST OF ABBREVIATIONS viii
CHAPTER 1: INTRODUCTION 1
CHAPTER 2: LITERATURE REVIEW 5
CHAPTER 3: PROBLEM FORMULATION 10
3.1 IE Difficulties 10
3.2 IE Techniques 11
3.2 Representation Model 12
CHAPTER 4: PROPOSED WORK 13
4.1 Data Set Description 13
4.2 Model Components 13
CHAPTER 5: SYSTEM DESIGN 17
CHAPTER 6: IMPLIMENTATION 18
V
LIST OF TABLES
VI
LIST OF FIGURES
VII
LIST OF ABBREVIATIONS
VIII
CHAPTER-1
INTRODUCTION
Given a large set of data, e.g., a database of movie reviews in our case, the goal of
Sentiment Analysis is to determine the emotional tone in each review and classify it
as positive or negative. Since the dataset is huge, comprising of over ten thousand
review files, we want to automate the process of sentiment analysis using machine
learning and natural language processing techniques. With the explosion of data in
recent years, a lot of information is available to us that can be used to improve
business strategies, help research and also solve social problems. Organizations can
use the technique of sentiment analysis to get to know the reactions of their
customers’ about their products and services.
The ability to extract insights from social data like Twitter, Facebook, Instagram, etc.
is a practice which is widely being used and adopted by different organisations all
over the world. From our research on past work done in sentiment analysis, there are a
number of methods that have already been successfully used for the given problem of
sentiment analysis. This has been discussed in more detail in the next page. We have
chosen the Random Forest classifier for classification of the text. One of its main
advantage is that it handles high dimensional spaces as well as large number of
training examples very well and eliminates the problem of overfitting which is
prevalent in such cases.
Polarity: It is the attribute that describes the text expression is either a positive or a
negative opinion,
Subject: It is the attribute which describes around which of the following topic the
thing is being talked about,
1
Opinion holder: It is the attribute which describes about the person, object or entity
which expresses the opinion of the extracted text.
In the current scenario, the technology of sentiment analysis is one of the topic with
high interest and development due to the reason that it has many practical applications
of the real life. As the information is constantly growing, either it is publicly or
privately available information, a large number of text expression opinions are readily
available in review sites, blogs, forums, and social media.
With the help of sentiment analysis techniques and systems, the unstructured
information/text can be converted into the structured data of public opinions
automatically about the products, politics, services, brands, or any other topic that
people can express opinions about. This structured and classified data can be useful
for different commercial applications like public relations, marketing analysis product
reviews, product feedback, net promoter scoring, and customer service.
Any industry’s success depends primarily on its customers. Social media has come
out as a boon for the customers. Customer response have become more and more
important. Social media avails a platform where every opinion matter. People can
endorse their favourites and show criticism more openly. With people’s voice louder
than ever, it’s very important for a brand to maintain a good reputation in the public
eye. This is the point at which sentimental analysis comes into the action.
Sentiment analysis, also known as opinion mining or emotion AI, has always been an
intriguing topic in the world of computer science. Much like any other fancy
technology in computer science, sentimental analysis still remains a widely bandied
yet misunderstood term. Breaking the term itself into two words, we get ‘sentiment’
and ‘analysis’. ‘Sentiment’ means emotions and ‘analysis’ means detailed
examination.
2
and biometrics are used to systematically identify, extract, quantify, and study
affective states and subjective information.
The best businesses is recognized as the one which completely understands the
sentiments of their customers and recognizes the answers to the following questions:
As we are aware that the “Sentiment Analysis” is the field of understanding the
emotions of the text/information available using the sentiment analysis tool which is a
must-understand tool for modern workplace business leaders and developers.
Along with this many other fields, we are able to advance researches and solution like
Deep Learning have created Sentiment Analysis as of the cutting-edge algorithms. In
the current scenario, we use natural language processing technique, statistics theory,
and text analysis for extraction, and identification of the sentiment of the extracted
text in three main categories of sentiments i.e. neutral, positive, or negative.
3
SENTIMENT ANALYSIS FOR CUSTOMER SERVICE
Customer service agents often use sentiment analysis to accordingly sort incoming
user email into “urgent” or “not very urgent” buckets on the basis of sentiment of the
email, proactively identifying frustrated users. The agent then directs their time
toward resolving the users with the most urgent needs first. As customer service
becomes more and more automated through Machine Learning, understanding the
sentiment of a given case becomes increasingly important.
A lot of those programs are already up and running. Bing currently incorporated
sentiment evaluation into its Multi-Perspective Answers product. Hedge finances are
nearly definitely the usage of the era to be expecting rate fluctuations primarily based
totally on public sentiment. And businesses like Call Miner provide sentiment
evaluation for client interactions as a service.
4
CHAPTER 2
LITERATURE REVIEW
ABSTRACT
In the current scenario when some people spread hate through social media platforms
like Twitter, Facebook, etc. Most of the people doing this type of work to spread hate
and rumours are mostly highly followed minister/celebrity or any other famous
person. This problem is even bigger when we look at the number of followers
following these leader or people and the increasing scalability of internet. This
increasing of these things can develop negative impacts in our society and wrong
tendencies in our youth and sometime it may force someone to commit some illegal
steps. Due to the reserved nature and busy schedules of people it is becoming
extremely difficult to interact with peers and family members. Therefore, social media
platforms are considered as the most used platform for getting news and general
awareness also. People feel free to share their political views and feelings over social
media such as Twitter, Facebook, etc with friends and family members via services
such as messaging. Therefore, analysis the tweets shared by a famous person will help
a lot in making a change in the society. The aim of this paper this task is to detect hate
speech percentage (neutral and positive speech percentage as well) in tweets. For the
simpler terms in this project, we are considering a tweet negative if it contains hate
speech such as (It has a sexist or racist sentiment associated with it which would
create a bad impact on the society and especially to the readers. Hence, the task is to
classify sexist or racist or happy or sad tweets from other tweets. This paper is divided
into six sections first about introduction second about literature survey third about
proposed methodology fourth about result analysis fifth about future scope and
application and finally sixth as conclusion.
INTRODUCTION:
Mostly all of the data-mining research based on the sentiment analysis assumes that
the information to be “mined” is by default present in the relational database form.
But the real scenario is a little different. The data that we get from many different
5
applications and mainly the data of the social media that we extract using different
tools (like for Twiiter, we use Twitter API for the extraction of the data) and the data
present after the extraction process is obviously not structure. Mostly, the data is of
the form of unstructured data instead of structured databases. As a result, the
difficulty of text mining has different phrases. For eg., the phase of discovering useful
knowledge from unstructured information present in the extracted text, is becoming
an increasingly important aspect in the discovery of the Knowledge Discovery.
Mostly all of the work in text mining and analysis doesn’t manoeuvre any form of
natural-language processing (NLP), treating extracted information as an unordered
“bag of words/sentence” which is typical in information retrieval.
The standard a vector space model of text represents a document as a sparse vector
that clearly specifies a weighted frequentness for every distinct word or token that
appear in a corpus among the large of words. Such a simplified representation of text
has been shown to be quite effective for a number of standard tasks such as document
retrieval, classification, and clustering. However, most of the knowledge that might be
mined from text cannot be discovered using a simple bag-of-words representation.
The entities referenced in a document and the properties and relationships asserted
about and between these entities cannot be determined using a standard vector-space
representation. Although full natural-language understanding is still far from the
capabilities of current technology, existing methods in information extraction (IE) are,
with reasonable accuracy, able to recognize several types of entities in text and
identify some relationships that are asserted between them. Therefore, extraction of
information can serve as an important factor in the technology for text mining and
analysis. If we consider that the knowledge to be discovered is expressed directly in
the information to be mined, then Information Extraction would could have single
headedly served as an effective approach to text/speech analysis. Nevertheless, if the
extracted information contains data in unstructured form rather than abstract
knowledge, it might be helpful to firstly use Information Extraction process in the
sentiment analysis to transform the unstructured data in corpus into a structured
database which is designed according to our requirement, and then use traditional
datamining tools for the identification of the abstract patterns in the extracted data.
There are two approaches to text mining for sentiment analysis with information
extraction, using one of our own research projects to illustrate each approach. First of
6
all, we introduce the basics of information extraction. Then, we discuss using IE to
directly extract knowledge from text. At last, we discuss finding knowledge by
mining data that is extracted in first step of extraction from unstructured or semi-
structured text.
7
BRIEF LITERATURE SURVEY:
There have been several researches done in the area of text mining. Now a days data
mining is becoming one of the emerging technology. Based on study of some papers
we have compiled our own literature survey as given below
Table 2.1: Brief Literature Survey
Research Name of
No. Advantages Disadvantages
Paper Title Authors
● Model
implements KNN
algo.
● This model
2 Enhancing Mark Chignell, analyzes the ● Very complex to
Predictive Power Nipon Electronic Health implement
of Cluster-Boo Charoenkitk Record to because even a
sted Regression arn, Jonathan H. improve single mistake
with Text Based Chan, efficiency. could harm
Indexing.[2] Wutthipong ● Model also someone’s health.
Kongburan. examine whether
textual features
can be used to
improve accuracy
of ICU mortality
prediction.
8
3 Financial Latent Nout ● In this model ● This model is
Dirichlet Kaunungsuk basically, not fast and takes
Allocation kasem, technical and much time to
(FinLDA): Teerapong fundamental predict.
Feature Extraction Leelanupab. analyses are used
in Text and Data by investors to
Mining for predict financial
Financial Time time evolution,
Series Prediction. such as stock
[ 5] prices.
● Using the
4 Multistage Gene Hong-Jie Dai, multistage GN ● This model can
Normalization and Po-Ting Lai, algorithm, we only be used for
SVM-Based and Richard have been able to multistage gene
Ranking for Tzong-Han improve system normalization for
Protein Interactor Tsai. performance by protein interactor
Extraction in Full- 1.719 percent extraction.
Text Articles.[6] compared to a
one-stage GN
algorithm.
● Our
experimental
results also show
that with full text,
versus abstract
only, INT AUC
performance was
22.6 percent
higher.
9
CONCLUSION:
Based on all 4 Research Papers we got a lot information about various models. Many
models implementing algorithms like Support Vector Machine algorithm, k-Nearest
Neighbours algorithm, term frequency–inverse document frequency model, Data
mining concept etc. Data Mining gives insights behind various decision. Our model is
also one of them. One of the basic understanding we got that if we want our model to
work effectively then we need to have good volume pf trained data. We implemented
NLP technique which is subset of data mining Techniques. For this we used Python as
programming language where we implemented many open-source libraries such as
TKInter, TextBlob. Here TextBlob has already trained data for different sentiments.
10
CHAPTER 3
PROBLEM FORMULATION
The ambiguity and complexity present in the human language is a huge different from
a computer to successfully understand human language. Apart from large, and
complex grammars, there are always some issue while predicting the meaning by
computer about the people using language freely, without adhering to rules.
Abbreviations, wrong spellings, idioms, and slang are just a few problems. Instead,
there is vast difference in terms of tone of people. Although, in social media we use
emoji in the current scenario which expresses the sentiment of the speech up to some
extent. We can summarize the issues faced during the process and different methods
to deal with them.
3.1 IE DIFFICULTIES:
Usually, text documents contain many words that aren’t necessary to understand the
general idea of the text. These mostly used words which doesn’t have a great impact
of the sentiment like ‘of’, ‘a’, and ‘the’ and are called stop words and which can be
directly ignored in many situations. This is the approach fields like IR take to reduce
the dimensionality of their term spaces and improve performance. It is mostly less
useful in the field of text analysis, nevertheless, cause some of these words can often
lead to clarify semantics and lend information. For an instance consider the below
example, where the statement is that “He was promoted” which contains mainly two
potential stop words i.e., “he” and “was”. Without “was” we interpret an entirely
different meaning: “He promoted” and same would have been the case if “he” was
not present in the sentence.
In such a scenario, Stemming (or lemmatization) comes into action which is also
generally used in Information Reduction (technically known as reduction of
dimensionality of the text). To classify similar words into one groups, we reduce
words to their stem, or root form. Let us take an example for such words which are
classified into one groups using stemming method, “walking”, “walk”, “walked”, and
“walker” will all be reduced to the root word “walk”. After all, it is possible to argue
that this effect is not as difficult as stop word lists, it can be harmful to the semantics
of the text but still it is an easy to reduce the information dimensions. Now, we look at
the part where we have to eliminate noisy data, we can use spelling correctors and
11
acronym and abbreviation expanders which generally requires a thesaurus or
dictionary. We must also try to deal with the larger issues in NLP as a whole which is
ambiguity. For example, we can try and feel the effect of the following words which
are lexically ambiguous, which becomes a difficult task for the algorithm to correctly
identify those words.
There are two types of ambiguity for such words which are lexically ambiguous:
a) Homonymy
b) Polysemy
In these type of ambiguity, the main difference is present in how the information is
present in the person’s mental lexicon. In homonymy, different words that happen to
have the same sound) which is not a problem which performing sentiment analysis of
some text as the correct word is present in the information. But polysemy which
means that one words is present many meaning and senses. Let us take an example for
an instance, the word “lean” which has different meaning in different parts of
grammar (or we can say senses). The word “lean”, if look at it as an adjective it
means “lacking or deficient in flesh” or “containing a little or absolutely no fat” or
“productiveness, sufficiency, or lacking richness” or “containing little valuable
minerals” or “of fuel mixtures with low in combustible component”. On the other
hand, it we look at it in form of a verb, it means “to bend, incline, or deviate, from
vertical position” or “to cast someone’s weight to another for support” or “to incline
in opinion, taste, or desire” or “ to rely for inspiration or support on someone”.
The word sense disambiguation (WSD) problem deals with finding the most probable
sense of a polysemous word (a word with multiple meanings). We can approach this
problem by considering the context in which the word occurs, to, for instance,
determine if a word is a noun or verb.
Tagging involves the labelling of words in a corpus with part of speech (PoS) tags or
XML mark-up. PoS tags label syntactic categories like nouns, verbs, and adjectives in
order to identify syntactic structures like noun phrases or verb phrases.
A collocation is a sequence of words that are commonly used together but mean
different things if separated. An example of this is “light rain”. In general, we will
treat the collocation as a main unit in spite of distributing it into two different words
12
because there is a consideration that we might lose the correct meaning of the text if
we only consider the words separately. (For an instance, let us consider the sentence,
“the person was playing ‘Holi’ and enjoyed a lot.’. In this sentence if look at the sole
meaning of the words present in the text, it would look like ‘Holi’ is a game which is
not. So, these type of things are kept into consideration as well while performing
information reduction.
And at the end, we perform the tokenization which is a subject that we would face at
some point during the above process. If we want to split the text into units of
sentences, phrases, paragraphs, of a particular length, or single words? To provide
support to our splitting we can take advantage of some delimiters which are present in
the text which are mentioned below:
a) Spaces
b) Tabs
c) Punctuation marks
d) Certain stop words.
We can even use a method like N-gram to find the most frequent word phrases, words
in the text extracted and processed.
13
learning their parameters from supervised training human automated that we require
to different tasks like finding the patterns that match the beginning and inductive logic
rules or ending of phrases. We can take the advantage of the Standard feature-based
classifier as well which are used in predicting the label of every token based on the
token and its surrounding context. By representing the context using a set of features
that include the one or two tokens on either side of the target token and the previously
extracted labels, we can generalize the sequence labelling problem to decision trees,
boosting, support-vector machines (SVMs), memory-based learning (MBL),
transformation-based learning (TBL), maximum entropy (MaxEnt), and many others.
Using the API of Twitter, we will be able to collect twitter dataset for the sentiment
analysis. There are many types of APIs and tool which can be used to crawl and
collect data:
a) Twitter’s Firehose
b) Twitter’s Search API
c) Twitter’s Streaming API
d) NodeXL
14
3.3 REPRESENTATION MODELS:
The most generally used representation of text present in the extracted text which is to
analysed sentimentally is the Vector Space Model. The extracted and processed text is
described by a vector of which dimension is the frequency of same text features and
content is a function of the numbers with which these features appear.
Things like the order and relations between words are totally ignored, which is why
this model is also called the (BOW) “bag-of-words” model. Most other
representations are extensions of the BOW model. Some focus on phrases instead of
single words, some give importance to the semantics and relations between words,
and others take advantage of the hierarchical nature of the text.
15
CHAPTER 4
PROPOSED WORK
The proposed methodology helps to save the life of the people who may be get
distracted by the negative tweets and some might take some serious steps after
reading those negative tweets. As a multiple application of the analysis, it is possible
to analyse the neutral, positive, or negative tweets of a certain twitter account or from
a twitter data set.
PROPOSED METHODOLOGY:
The aim is to extract information from the tweets of the user account on Twitter
(Generally, a famous person, who has sufficient number of followers) and use it for
sentiments analysis for different purposes to find the percentage of neutral, positive,
and negative tweets the person is posting on the social media. The aim of this paper
this task is to detect hate speech percentage (neutral and positive speech percentage as
well) in tweets. For the simpler terms in this project, we are considering a tweet
negative if it contains hate speech such as (It has a sexist or racist sentiment
associated with it which would create a bad impact on the society and especially to
the readers. Hence, the task is to classify sexist or racist or happy or sad tweets from
other tweets. The model also includes the analysis of emoticons in order to completely
parse the statements the aim is to extract information from the text messages of the
user and use it for different purposes such as sentiments analysis.
In this component the data is assigned a sentiment such as positive or negative and the
extent of it by performing data pre-processing using Support Vector Machine
algorithm
• TEXT PRE-PROCESSING:
The processes involved in text pre-processing are:
Tokenization: Each tweet of the account is distributed into meaningful words which
are known as tokens. Example - “Morning walk is a bliss” is converted to “Morning”
“walk” “is” “a” “bliss”.
Data standardization: It involves converting all words in the message in standard
form, converting all words in lower case [10]. Example. “The market is near Puneet’s
house” is converted to “the market is near Puneet’s house”.
Emoji conversion: The emoticons present in the text messages are assigned a keyword
based on the expression they convey.
There are two type of emoticons which are classified as following:
Positive emoticons: These are the emoticons which convey positive sentiment and are
replaced by positive words based on the symbol.
17
Negative emoticons: These emoticons reflect the sad or disturbed sentiments of the
subject and are thus replaced by negative words.
18
Stop-word-removal: All the words in the message which do not convey a special
meaning are removed like a, the, then, etc. Stemming is the process which is focused
on obtaining the root word which corresponds to every word by dropping suffixes ling
–ion, -ing, etc. Abbreviation analysis: Replacing the abbreviations present in the
message by their full forms. Example FB by Facebook, GM by good morning, etc
• N-gram
The next step after data pre-processing is N-gram features extraction. N-gram is a
series of n tokens. N-gram is a model very widely used in NLP tasks [3]. The model
creates N-grams from the messages in the data set to extract keyword features from
the data set. For n = 3 a sequence of three-words for each message is generated. The
process of N-gram increases the efficiency and accuracy of the classification step
because of the feature extracted from three sequence of token combination. Example.
“What is your name” is analysed as “what is your” “is your name”.
• Term Frequency
The number of times a token occurs in each data sample is called its term frequency.
Words which are present in high frequency are considered to have a better
relationship with the sample.
• KNN Algorithm
The output obtained from Support Vector Machines Algorithm are clusters of two
sentiments with class labels “normal” and “critical”. Based on the output k-Nearest
Neighbour algorithm is applied in order to deduce the overall sentiments of the
subject. The input for k-Nearest Neighbour algorithm is the sentiments associated
with all the chats that the subject is involved in. The final step is to predict the
sentiment of the person based on the collected feature set. Data is divided into two
sets i.e. training and testing sets, and k-Nearest Neighbour algorithm is used to predict
the sentiment of the text processes. K-Nearest Neighbour algorithm is a method for
classifying data based on the nearest training sets in the feature space. The class label
is assigned the same class as the nearest K instances in the training set. K-Nearest
Neighbour is a type of lazy learner strategy. K-Nearest Neighbour algorithm is
considered a flexible and simple classification technique based on machine learning
concepts.
Fig. 5.1 A simple architecture of the recommender system Block diagram of the system can be
described as below which shows the flow of working model of the system
21
Fig. 5.2 Block diagram of process
22
Fig. 5.3 Classification of Sentiment Analysis
Fig. 5.4 System Flow Diagram Sentiment Analysis using Twitter API
CHAPTER 6
23
IMPLEMENTATION
The process of implementing information extracted using data mining system can be
realized in given pieces.
1) As we can see from this brief survey, while the fields of text mining and
information extraction are rich with proven techniques and promising results.
2) They also offer new directions and hopes of leaps in the areas of efficiency,
accuracy, and usability. This trend will only strengthen as more and more knowledge
floods the Internet and users all over the globe strive for efficient extraction and
interpretation of information.
3) Information extraction using mining also help to understand someone’s emotion
which might be used further.
4) Getting meaningful information from someone’s text messages and analysing it
further gives an insight of personality of somebody.
5) People suffering from mental disorder always hesitate to say someone about their
problem but our model helps them to find the way to get out of mental trauma.
6) Due to busy lifestyle now, many people even don’t take care of themselves, based
on information extracted from their massage model could suggest them how to bring
positive change in their lifestyle.
7) Most IE systems are developed by training on human annotated corpora. However,
constructing corpora sufficient for training accurate IE systems is a burdensome
chore.
8) The data access and integration service provide web service which interacts with
data sources. The search service communicates the user entered search criteria to the
server.
9) The restrict security enforcement layers guard the resources and authorize access
based on roles of users. The remaining users are used as training data (i.e., the set of
users to which compare the test users for recommendations). The aim in testing is to
correctly recommend the withheld items from the test users' usage patterns.
24
Fig. 6.1 System UI
Packages: There are a lot of python packages are available which can be used the
sentiment analysis. Some of the available packages are NumPy, Pandas, matplotlib,
wordcloud, etc. We can directly import these packages in our code while
implementing.
Some packages that are used in Sentiment analysis:
NumPy
Pandas
Wordcloud
TextBlob
Matplotlib
TensorFlow
25
Data Set Description: The data is obtained by extracting all the text messages sent by
the subject. This can be taken from Twitter. All the tweets posted on Twitter are
stored in a database, we can analyze the sentiments on there.
The data set will contain tweets in text format and emojis. Data, that is in another
format cannot be analyzed.
Using the API of Twitter, we will be able to collect twitter dataset for the sentiment
analysis.
There are many types of APIs and tool which can be used to crawl and collect data:
Twitter’s Firehose
Twitter’s Search API
Twitter’s Streaming API
NodeXL
26
Text Cleaning: Initial step of sentiment Analysis techniques is applied to data in order
to reduce the dimensionality and the noise of text, along with that assistance in the
improvement of effectiveness of classification.
27
Classification of Sentiments: There are different types of sentiment analysis in which
it can be performed:
Machine Learning
Lexicon-Based
Hybrid
28
CHAPTER 7
RESULT ANALYSIS
The result obtained from the proposed model gives the estimated sentiment prediction
of the subject based on the tweets by the user account. The resulting output which we
got after performing the sentiment analysis can be used in many scenarios, such as,
better marketing, brand value, good and positive politics. The percentage estimated
for positive, negative, and neutral sentiments are therefore generalised as “critical”
sentiments among the peers and society members of the subject which can take
actions to perform actions accordingly. Hence, sentiment analysis models are basic
requirement for shaping the society into a happening place. The result obtained from
the proposed model gives the estimated sentiment prediction of the subject based on
the tweets sent by the user. The resulting output can be used in many situations, the
mental disorders and stress level is estimated in the country and therefore in case of
“critical” sentiments, police can take actions to discourage, demotivate and press such
leaders who posts such content which results in destroying the harmony and peace of
mind of the subject/reader. In the end, such sentiment analysis models are a
requirement for shaping the society into a happening place.
29
Fig 7.2: Result Analysis of Narendramodi’s Twitter Account
30
CHAPTER 8
CONCLUSION, LIMITATIONS AND FUTURE SCOPE
8.1 CONCLUSION
The proposed model takes input from the data set created by accumulating all the
tweets sent by the user. All the tweets may be from different social media platforms
such as Facebook, WhatsApp, etc. Further, these tweets are pre-processed to obtain
the key words from the data sets. After pre-processing we use probabilistic language
models like N-gram. Associating weights to the data set using TF-IDF increases
overall efficiency of classifying algorithms. The next step is to use the classifying
algorithms to classify the sentiments of the tweets as “positive”, “negative” or
“neutral”. [6] First a supervised algorithm is used which is Support Vector Machine as
it proves to be highly efficient for such computations and then an unsupervised
algorithm is used which in turn increases the efficiency drastically, in our case we use
the KNN algorithm. Thus, we propose to give a highly efficient method [7] of finding
the sentiment of the person by analysing the text messages and also processing
emoticons. Emoticons [12] are very common tokens in any text message in the new
world, therefore we must also focus on efficient ways to analyse them. We have
converted emoticons to textual form for our computation processes. Thus, this model
is a requirement and a life saviour in the modern world.
8.2 LIMITATION
In practice, many commercial recommender systems are based on large datasets
extracted from different social media accounts. As a result, the user-item matrix is
used for collaborative filtering which could be very large and sparse, and it brings
some of the challenges in the performances of the recommendation system
performing sentiment analysis. Some of the problems are defined below such as the
problem caused by the sparsity of data which is the cold start problem. As we all
know that the human language always consist of some ambiguity and complexity
which is a giant hindrance to a successful understanding of a computer. Even with
complex and large grammars, there is always some issue with the way of people using
31
language freely. Some of the common issue which people perform now-a-days are
mentioned below:
a) Mostly, people do not adhere to rules of grammar while writing tweets.
b) Misspellings
c) Abbreviations
d) Use of slang
These are only some of the issue that we have mentioned. Apart from this also, there
are number of issue with the language used on the social media.
We can summarize some of the issues that we come across and methods we can use to
deal with them. Usually, text documents contain many words that aren’t necessary to
understand the general idea of the text. The stop words which are present in high
frequency are the words as mentioned below:
a) ‘a’
b) ‘the’
c) ‘of”
These type of words are generally known as stop words and can be directly ignored in
many cases. This type of approach fields like Information Reduction takes place to
reduce the dimensionality of their term spaces and hence improves performance. It is
less useful in the field of text mining, however, because these words can often lend
information and clarify semantics. For example, the statement “She got present”
contains two potential stop words: “She” and “got”. Without “got” we interpret an
entirely different meaning: “She present”. There is a consideration that we might lose
the correct meaning of the text if we only consider the words separately. (For an
instance, let us consider the sentence, “the person was playing ‘Holi’ and enjoyed a
lot.’. In this sentence if look at the sole meaning of the words present in the text, it
would look like ‘Holi’ is a game which is not. So, these type of things are considered
as limitation of sentiment analysis.
Stemming (or lemmatization) is also commonly used in IR. To group similar words
into one, we reduce words to their stem, or root form. For example, “walking”,
“walk”, “walked”, and “walker” will all be reduced to the root word “walk”.
To eliminate noisy data, we can use spelling correctors and acronym and abbreviation
expanders. These usually require a dictionary or thesaurus.
32
8.3 FUTURE SCOPE
The proposed model can be used in situations where sentiment analysis is required to
achieve the desired result and use it for various different purposes such as critic
reviews for hotels, [8] movies, videos, etc. Sentiment analysis methods till now have
been used to detect the polarity in the thoughts and opinions of all the users that
access social media. Businesses are very interested to understand the thoughts of
people and how they are responding to all the products and services around them.
Companies use sentiment analysis to evaluate their advertisement campaigns and to
improve their products. Companies aim to use such sentiment analysis tools in the
areas of customer feedback, marketing, CRM, and e-commerce.
33
REFERENCES
[1] May, R. M. 1997. The Scientific Wealth of Nations, Science, vol. 275, no. 5301,
pp. 793-796.
[4] Fano, R. M. 1956. Information theory and the retrieval of recorded information,
in Documentation in Action, Shera, J. H. Kent, A. Perry, J. W. (Edts), New York:
Reinhold Publ. Co., pp.238–244.
[5] Small, H. 1973. Co-citation in the scientific literature: a new measure of the
relationship between two documents, Journal of the American Society for
Information Science, vol. 24, pp. 265–269
[7] Garfield, E. and Well jams-Dor of, A. 1992. Citation data: their use as
quantitative indicators for science and technology evaluation and policy-making,
Science & Public Policy, vol. 19, no. 5, pp. 321-327.
[8] G. Lin, H. Zhu, X. Kang, C. Fan, and E. Zhang, “Feature structure fusion and its
[9] J. He and N. Xiong, ‘‘An Effective Information Detection Method for Social big
data,’’ in Multimedia Tools Appl., vol. 77, no. 9, pp. 11277–11305, 2018.
[10] H. Si, Z. Chen, W. Zhang, J. Wan, J. Zhang, and N. N. Xiong, ‘‘A member
recognition approach for specific organizations based on relationships among users
in social networking Twitter,’’ Future Gener. Comput. Syst., vol. 92, pp. 1009–1020,
Mar. 2019.
34
[11] Positive and Negative Emoticons
(https://images.app.goo.gl/Pr3JvNdcbaVAiZB27)
35
APPENDIX
CODE:
twitterApiKey = config['twitterApiKey'][0]
twitterApiSecret = config['twitterApiSecret'][0]
twitterApiAccessToken = config['twitterApiAccessToken'][0]
twitterApiAccessTokenSecret = config['twitterApiAccessTokenSecret'][0]
37
exclude_replies=True,
contributor_details=False,
include_entities=False
).items(50);
df = pd.DataFrame(data=[tweet.text for tweet in tweets], columns=['Tweet'])
df.head()
def cleanUpTweet(txt):
# Remove mentions
# Regular expression of mentions will be '@[A-Za-z0-9_]+' because it starts with @ and
# have twitter id
txt = re.sub(r'@[A-Za-z0-9_]+', '', txt)
# Remove hashtags
# Regular expression of hashtags will be start '#' because it starts with #
# Remove retweets:
# Regular expression of retweets will be ‘RT:' because in the text it starts from RT: in the
#extracted text.
# Remove urls
# Regular expression of retweets will be 'https?:\/\/[A-Za-z0-9\.\/]+' because it starts from
#https
38
txt = re.sub(r'https?:\/\/[A-Za-z0-9\.\/]+', '', txt)
return txt
df['Tweet'] = df['Tweet'].apply(cleanUpTweet)
def getTextSubjectivity(txt):
return TextBlob(txt).sentiment.subjectivity
def getTextPolarity(txt):
return TextBlob(txt).sentiment.polarity
df['Subjectivity'] = df['Tweet'].apply(getTextSubjectivity)
df['Polarity'] = df['Tweet'].apply(getTextPolarity)
df.head(50)
df = df.drop(df[df['Tweet'] == ''].index)
df.head(50)
def getTextAnalysis(a):
if a < 0:
return "Negative"
elif a == 0:
return "Neutral"
else:
return "Positive"
df['Score'] = df['Polarity'].apply(getTextAnalysis)
df.head(50)
positive = df[df['Score'] == 'Positive']
39
#First part of the output includes the percentage of positive tweets. It is shown as a bar graph
#along with the numerical percentage.
print(str(positive.shape[0]/(df.shape[0])*100) + " % of positive tweets")
positive = df[df['Score'] == 'Negative']
#First part of the output includes the percentage of negative tweets. It is shown as a bar
#graph.
print(str(positive.shape[0]/(df.shape[0])*100) + " % of negative tweets")
objective = df[df['Subjectivity'] == 0]
#First part of the output includes the percentage of neutral tweets. It is shown as a bar graph.
#along with the numerical percentage
print(str(objective.shape[0]/(df.shape[0])*100) + " % of objective tweets")
labels = df.groupby('Score').count().index.values
values = df.groupby('Score').size().values
plt.bar(labels, values)
40
# add legend
plt.show()
#objective = df[df['Subjectivity'] == 0]
#There are many for visualization but in the project we have chosen wordcloud as it displays
#the words according their size i.e. more the frquency of the word used larger the size of the
#word in word cloud.
#print(str(objective.shape[0]/(df.shape[0])*100) + " % of objective tweets")
41
OUTPUT:
42