Sentiment Analysis on Tweets using Machine Learning

Richa Garg1, Dr. Ochin Sharma2
Research Scholar, 2Assistant Professor
Manav Rachna International Institute of Research and Studies, Faridabad, India
Abstract- The approach that is used to predict the polarity of data is divided in the big data analysis. Structured data is
any content as being positive, negative or neutral is called stored in rows and columns of a table which is included in the
sentiment analysis. This research work designed a new database SQL [4]. It is highly organized and also has a
mechanism for this study using the previous study as its base. relation key through which mapping the pre-designed fields is
The classification and feature extraction techniques are very easy. Semi-structured data has some authoritative
combined together to design this proposed technique. To properties even if it is not available in the relational database.
perform feature extraction, n-gram algorithm is applied. It is easier to analyze this kind of data. A special
Further, the input data is categorized among positive, negative organizational format is generated by organizing this data [5].
and neutral using KNN classifier. Certain performance The unstructured type of data comprises of around 80% of the
parameters such as recall, accuracy and precision are total amount of existing data. There is no particular structure
calculated for validating the proposed system. It is seen of this kind of data. This category mainly includes text and
through the experimental results that in comparison to the multimedia kind of data. An application that includes
existing approach that uses SVM classifier, the performance computational linguistics is called Natural Language
of proposed approach is better. Processing (NLP). The text is interpreted and analyzed
through NLP. The area of Computer Science and Artificial
Keywords- SVM, KNN, Naïve Bayes, Fake Profile Detection Intelligence which helps in interacting and interpreting
computer and human natural language is known as NLP. For
I. INTRODUCTION providing appropriate review about the product, the complete
A study through which important knowledge is extracted from information and opinion about the product are collected and
the raw data by breaking it down is called data analytics. This categorized. For the analysis of opinions of individual users,
process can help in understanding the actual scenario of the several improvements of the collected data are done [6]. The
user’s work. It can help in making better decisions. The data opinions of users can be posted through blog posts with the
analytic process includes certain actions like cleansing, help of various social networking platforms. The manners in
inspection, modeling and transformation which collectively which users express their views and opinions are changed
help in discovering important information present within the using the social network sites among which few popularly
data [1]. In the data examination process, several facets and known sites are Google, Instagram, Twitter and Facebook. All
approaches are designed. With separate names, numerous the reviews of clients related to the products and services can
techniques have been designed in individually separate be achieved here. The recovery of textual information
domains. To extract important information such that it can be technique is processed, searched and analyzed by the available
used in predictive forms a particular data investigation method accurate data. Developing new applications becomes
has been proposed which is named as data mining [2]. challenging since the blogs and social sites include huge
However, business intelligence process is known as the amount of data [7]. For tweeting the ongoing messages in
analysis which is performed on the basis of aggregations that twitter, micro blogging and social networking sites are
are completely dependent on the business information. includes. For creating new challenges and shaping various
Investigation is known as the process in which multiple domains and methods included in sentiment analysis, various
components are generated from one complete document for unique properties included in tweets are used. Several kinds of
proper investigative study [3]. To ensure that the data can be approaches are used to perform text classification which
used at the time of generating decisions, the raw data is further helps in performing twitter sentimental classification.
converted into useful form. The collection of data from There are lexicographical resources are used in these
numerous sources and then analyzing it helps in answering approaches. The parts of sentiment words and their orientation
several questions such that the hypothesis can be tested or any are used to expand their set by finding their antonyms and
theory can be disapproved. For data investigation a procedure synonyms are collected in the initial method [8]. The machine
is followed. Further, different techniques are executed for learning based approach is used to solve the various problems
result interpretation and also for proper data arrangement. related to sentence classification. On a human labeled training
Thus, a highly precise data analysis can be performed and dataset, a text classifier is trained. There are two commonly
when applying proper measurements, data dissecting can be known approaches used here which are supervised and
performed. There are three major categorizations in which


unsupervised learning approaches. Hybrid approach is physically generated articles. The comparison of these
generated by combining the elements of lexicon-based as well algorithms was performed on time scale basis and on the basis
as machine learning techniques to perform sentiment analysis. of their efficiency in the extraction of keywords. The
For determining the semantics available in sophisticated outcomes were achieved and compared with the scores of
manner, these approaches are used as semantics networks and manually created summaries. These summaries were used as
ontology. To perform text classification, various kinds of benchmark for comparison. The algorithms by now reached
classifiers are used [9]. They help in performing twitter the score achieved by human summaries. In future, the more
sentiment classification in highly efficient manner. Even attention will be given to the enhancement of summary
though there is low Naïve Bayes classification probability, accuracy by using these algorithms together with the help of
efficient results are achieved by Naïve Bayes algorithm. This machine learning and swarm based techniques.
algorithm is based on Bayes theorem and is a supervised
machine learning approach. To perform classification, a huge Shahnawaz, et.al, (2017) stated that sentiment analysis
edge is achieved through SVM classifier. The hyper plane procedure was used for identifying opinion or beliefs
technique is used to separate the tweets based on the different articulated in the opinioned information to recognize the
between tweet and hyper plane. feelings of writer with respect to a specific topic [13]. The
curiosity of the technical society and trade world was
II. LITERATURE REVIEW increasing gradually in gathering, processing and extracting of
Rupal Bhargava, et.al (2017) proposed approach used information from the public reviews presented on different
different machine learning techniques for analyzing text. In social media platforms. Incapability to show good
the system, Machine translation was used to deal with performance in different fields and scarce correctness were the
different features of different types of languages [10]. In order major issues of existing methods. These issues rose due to
to find sentiments within the text, text was processed after inadequate labeled information, inability to handle difficult
machine translation process. Substantial text was obtained on sentences that required surplus sentiment words and simple
the internet with the introduction of blogs, forums and online assessments. It was identified that semi-supervised and
analysis. The extraction of important text from this Substantial unsupervised learning based models could be used to solve the
text was proved helpful in the reduction of processing. issue of labeled information scarcity.
Therefore, in this study, text summarization process was used
for extracting significant portions of text. These portions were N. Moratanch, et.al (2017) stated that the text summarization
utilized to scrutinize sentiments about the specific subject and system contained several methods. These methods were
its sides. The tested outcomes demonstrated that the proposed categorized as the extractive and abstractive methods [14]. In
approach showed good performance. this study, a wide-ranging analysis of extraction based text
summarization methods was given. In this study, a review on
Archana N.Gulati, et.al (2017) stated that a text summary extractive summarization methods was presented by
was identified as the reduction of real text. In text summary, classifying these methods into supervised learning and
the text was summarized by choosing important text within unsupervised learning approaches. On the basis of different
the source. In the last few years, with the growth of World techniques, the benefits of these approaches were presented in
Wide Web, large volume of information was generated and this study. A number of assessment techniques, concerns and
presented through internet [11]. Text summarization was future researches were included in this review as well.
needed to get the essence of a specific topic from several
sources of information existing online. In this study, a novel Manisha Gupta, et.al (2016) proposed a new technique for
method named as extractive text summarization was proposed the text summarization of Hindi text document on the basis of
for various documents. The proposed approach achieved a a number of linguistic conventions [15]. In order to generate
standard accuracy of 73% over numerous Hindi documents. lesser number of words from the original text, dead wood
The system made summary was very close to the human made words and phrases were detached from the real document as
summary. The accuracy of summary system generated well. The proposed approach was examined on different Hindi
summary was shown in terms of Precision, Recall and F-score sources of info and precision of the system in terms of number
values. of lines retrieved from real text having significant information
of the real text document. It was identified that the proposed
Akshi Kumar, et.al (2017) performed the comparison of approach reduced the text size of information up to 60% - 70
three keyword extraction techniques. These approaches were %. It was also recognized that system generated the extractive
utilized in the automatic text summarization systems on the summary provided by the client. This denoted that it did not
basis of text demonstration and summary creation factors [12]. produce text summary based on the principle of the text
These approaches were compared on a universal data suite of semantics.


III. RESEARCH METHODOLOGY D. Extraction of Features
The figure 1, shows the construction of the proposed system During the extraction of characteristic from data sample, the
which is based on N-gram and KNN classification model. main problem occurs inside the sentiment analysis. For the
representation of the characteristics of a product, a noun is
A. Dataset used always. POS tagging is used to identify and extract all
Mainly two categories of information samples are produced the nouns for the recognition of all features. Extremely
here manually. One data sample is utilized for training and exceptional features are needed to be removed from here.
other dataset is utilized for testing. Inside the training sample, After the elimination of rarely present features, a trail of
X: Y association is presented. X is used for the representation frequently generated characteristics can be obtained. For
of possible estimation remark grade and Y is utilized for the feature extraction and for the post tagging of phrases, N-gram
estimation of positive or negative grade. The testing set is approach is utilized.
produced after the attainment of remarks available on different
social media sites. For the identification of optimistic or E. Define Positive, Negative and Neutral Words
pessimistic test sample, a remark is tagged physically. After The words representing a particular feature can be extracted
the completion of training, the appraisals will be alienated on with the help of Stanford parser approach. The parser gathers
the basis of optimistic and pessimistic opinions. With the help the grammatical reliance’s present amid the words used in
of appraisal from the test sample, gathered earlier with known sentences and applies it as output. The reliance will be
polarity, the testing process of method is performed. The considered in the next few stages, for the identification of
accurateness of the arrangement may be resolute on the base features of opinion words gathered from the last stage. The
of outcomes produce by the scheme. straight reliance is considered a straight recognition of view
lexis for specific features. The indulgence of transitive
B. Data Preprocessing reliance is also needed in association with straight reliance’s
Mainly three kind of pre processing methodologies named inside this stage.
Stemming, error correction and stop word removal are
implemented in this research work. The essential job of F .SentiWordNet
stemming process is the detection of a root of a statement. The The Sentiwordnet is generated particularly in the opinion
main objective of this technique is the removal of suffixes and mining applications. Mainly three related polarities exist for
number of terminology concerned. With the help of this every word inside the Sentiwordnet and they are recognized as
approach, time and memory consumption of the scheme can positivity, negativity and subjectivity. For example within the
be reduced up to large extent. The development of error SentiWordNet, 125 is the complete grade for word ‘high’.
correction system is necessary because alike grammatical Also the word high may not be used as positive word in some
regulations, punctuation as well as spellings are not used by sentences like “cost is high”. This sentence represents a
all the assessors. The situation may be unstated in dissimilar negative approach in fact. Therefore these kinds of situations
because of these errors and therefore some kind of also measured carefully.
rectification is needed. The stop words are removed for the
minimization of text complication. The nucleus orientation of G. K-Nearest Neighbor Classifier
the declaration may get effect because of the removal of some A classifier named KNN is chosen for this approach. Because,
terminology such as “it” which must be evaded. sentiment analysis is a binary classification and a large
amount of data samples are present for execution, therefore
C. Lexical Analysis of Sentences KNN classifier is selected in this study. A physically
The sentence which comprises either an optimistic or a generated training used to train the classification model. An
pessimistic opinion is known as subjective sentence. Though, X: Y relation is provided inside the training suite where x
there are some questions or phrase written by the followers represents the score of an opinion word while y is used for the
which may not embrace any emotions inside them are called representation of positive or negative word and gives them
as objective sentences. In order to minimize the absolute size score accordingly [15]. The score of opinion word relevant to
of the review, these sentences can be detached to reduce the the feature in the review is applied as input to the KNN
absolute dimension of review. A query is mostly produced classification model.
with the adaption of some words like “where” and “who”,
these words in the phrase do not give any opinion. Such kind H. Extraction of Feature Wise Opinion
of phrase is eliminated from the records as well. The usual For the extraction of opinion associated with a specific
terminology implicated inside python does not distinguish characteristic, entire remarks which involve characteristic
these queries. must be utilized. For the attainment of a particular
characteristic, the remarks containing optimistic opinions
rationed with complete existing remarks are computed. For


the evaluation of negative score of a specific characteristic,
the ratio of whole number of appraisal inside a negative
emotion connected to a feature is applied to the whole figure
of appraisal present is considered. 100

acuracy percentage ->

40 Accuracy
Fig.2: Accuracy Comparison

As shown in figure 2, the accuracy of the three classifiers are

compared for the performance of analysis. The three
classifiers are SVM, KNN and Naïve bayes for the
performance analysis.

Value ->

0.5 Precision
Fig.1: Proposed System Architecture Recall
The proposed research work is implemented in Python and the 0.1
results are evaluated by comparing the results of proposed and 0
existing techniques in terms of different performance SVM KNN Naïve
parameters. Bayes
Fig.3: Precision-Recall Comparison

As shown in figure 3, the precision-recall of the three

classifiers are compared for the performance of analysis. The
three classifiers are SVM, KNN and Naïve bayes for the
performance analysis


Conference on Development and Learning and Epigenetic
Execution Time Robotics (ICDL-EpiRob).
[3]. L. SUanmali, M. S. Binwahlan, and N. Salim. Sentence features
fusion for text summarization using fuzzy logicin Hybrid
3.5 Intelligent Systems. 2009, HIS'09, Ninth International
Conference on, vol. 1, IEEE, 2009, pp. 142-146.
3 [4]. L. Suanmali, N. Salim, and M. S. Binwahlan. Fuzzy logic based
method for improving text summarization. arXiv pre print
2.5 arXiv:0906.4690, 2009.
[5]. X. W. Meng Wang and C. Xu. An approach to concept oriented
2 text summarization, Proceedings of ISClTS05, IEEE
internationalconference, China,1290-1293" 2005.
1.5 Execution Time [6]. M. G. Ozsoy, F. N. Alpaslan, and 1. Cicekli. Text
summarization using latent semantic analysis. Journal of
1 Information Science, vol. 37, no. 4, pp. 405-417, 2011.
[7]. Adyan Marendra Ramadhani, Hong Soon Goo. Twitter
0.5 Sentiment Analysis using Deep Learning Methods. 7 th
International Annual Engineering Seminar (InAES),
0 Yogyakarta, Indonesia, 2017.
SVM KNN Naïve [8]. K. Kaviya, C. Roshini, V. Vaidhehi, J. Dhalia Sweetlin.
Sentiment for Restaurant Rating. 2017 IEEE International
Conference on Smart Technologies and Management for
Fig.4: Execution Time Comparison Computing, Controls, Energy and Material (ICSTM).
[9]. Devika M D, Sunitha C, Amal Ganesh “Sentiment Analysis:A
As shown in figure 4, the execution time of three classifiers is Comparative Study On Different Approaches”,Procedia
Computer Science, vol.87 , pp. 44-49,2016
compared for the performance of analysis. The three
[10]. Rupal Bhargavaand Yashvardhan Sharma. MSATS:
classifiers are SVM, KNN and Naïve bayes for the Multilingual Sentiment Analysis via Text Summarization, IEEE,
performance analysis vol. 9, iss. 8, pp- 97-110, 2017
[11]. Archana N.Gulati, Dr.S.D.Sawarkar. A novel technique for
V. CONCLUSION multi-document Hindi text summarization. 2017 International
The emotions and attitudes of individuals on certain events are Conference on Nascent Technologies in the Engineering Field
handled through sentiment analysis. In different applications (ICNTE-2017), vol. 8, pp. 1-4, 2017.
like reviewing products, analyzing social media content or [12]. Akshi Kumar, Aditi Sharma, Sidhant Sharma,Shashwat
reviewing the movies, opinion mining is used commonly. This Kashyap. Performance Analysis of Keyword Extraction
research is based on combining the KNN and n-gram Algorithms Assessing Extractive Text Summarization.
classifiers for performing sentiment analysis. Several International Conference on Computer, Communication, and
Electronics (Comptelix), 2017.
techniques have been designed for performing sentiment [13]. Shahnawaz, Parmanand Astya “Sentiment Analysis:
analysis over the past years. The previously designed Approaches and Open Issues” International Conference on
technique that used SVM classifier for classifying the tweets Computing, Communication and Automation, vol. 9, pp. 1-5,
as positive, negative and neutral was used as base to design 2017
the new approach. N-gram and KNN classifiers were used in [14]. N. Moratanch, S. Chitrakala. A Survey on Extractive Text
the new approach such that the extraction of input was done Summarization.IEEE International Conference on Computer,
using N-gram and data was categorized based on their polarity Communication and Signal Processing (ICCCSP), vol. 8, pp. 1-
using the KNN classifier. The proposed and existing 4, 2017.
approaches are compared with each other for performance [15]. Manisha Gupta, Dr.Naresh Kumar Garg. Text Summarization of
Hindi Documents using Rule Based Approach, International
evaluations. It is seen that around 7% of improvement in Conference on Micro-Electronics and Telecommunication
sentiment analysis is achieved by implementing the proposed Engineering, vol. 8, pp. 1-4, 2016.

[1]. Tharindu Weerasooriya, Nandula Perera, S.R. Liyanage. A
method to extract essential keywords from tweet using NLP.
2016 16th International Conference on Advances in ICT for
Emerging Regions(ICTer).
[2]. Ibrahim A. Hameed. Using Natural language processing for
designing socially intelligent robots. 2016 Joint International


