Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Minor New Report

Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

SENTIMENTAL ANALYSIS

A
MINOR PROJECT REPORT
Submitted for the partial fulfillment of the requirement for the award of Degree
B.Tech.
IN
COMPUTER SCIENCE & ENGINEERING

Submitted By: Guided By:


Aditi Agrawal- 0101CS201005 Dr. Shikha Agrawal
Kavery Pandey-0101CS201059 Prof. Manish Mishra
Kshitij Yadav -0101CS201063

UNIVERSITY INSTITUTE OF TECHNOLOGY RAJIV GANDHI


PROUDYOGIKI VISHWAVIDALAYA BHOPAL-462033
SESSION 2020-2024
RAJIV GANDHI PROUDYOGIKI VISHWAVIDYALAYA, BHOPAL

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

CERTIFICATE

This is to certify that Aditi Agrawal, Kavery Pandey, Kshitij Yadav of B.Tech
Third Year, Computer Science & Engineering have completed their Minor
Project entitled "Sentimental Analysis using NLP" during the year 2023 under
our guidance and supervision.

We approve the project for the submission for the partial fulfillment of the
requirement for the award of degree of B.E. in Computer Science &
Engineering.

Dr.Shikha Agrawal Prof. Manish Mishra


Assoc. Professor, DoCSE Assistant Professor, DoCSE

Project Guide Project Guide

Prof. Uday Chaurasia

Head of Department,
DECLARATION BY CANDIDATE

We, hereby declare that the work which is presented in the minor project,
entitled "Sentimental Analysis using NLP" submitted in partial fulfillment of the
requirement for the award of Bachelor degree in Computer Science and
Engineering has been carried out at University Institute of Technology RGPV ,
Bhopal and is an authentic record of our work carried out under the guidance of
Dr. Shikha Agrawal (Project Guide) and Prof. Manish Mishra (Project
Guide) ,Department of Computer Science and Engineering, UIT RGPV, Bhopal.

The matter in this project has not been submitted by us for the award of any
other degree.

Aditi Agrawal – 0101CS201005

Kavery Pandey -0101CS201059

Kshitij Yadav - 0101CS201063


ACKNOWLEDGEMENT

After the completion of minor project work, words are not enough to express our
feelings about all those who helped us to reach our goal, feeling above all this is
our indebtedness to the almighty for providing us this moment in life.

First and foremost, we take this opportunity to express our deep regards and
heartfelt gratitude to our project guide Dr. Shikha Agrawal and Prof.
Manish Mishra of Computer Science and Engineering Department,
RGPV Bhopal for their inspiring guidance and timely suggestions in carrying
out our project successfully. They have also been a constant source of inspiration
for us.

We are extremely thankful to Prof. Uday Chourasia , Head, Computer


Science and Engineering Department, RGPV Bhopal & Prof. Manish
Ahirwar ,minor project coordinator DoCSE for his cooperation and
motivation during the project. We would also like to thank all the teachers of our
department for providing invaluable support and motivation. We are also
grateful to our friends and colleagues for their help and cooperation throughout
this work.

Aditi Agrawal - 0101CS201005

Kavery Pandey - 0101CS201059

Kshitij Yadav - 0101CS201063


Abstract

Sentiment analysis, also known as opinion mining, is a technique used to extract


and classify the sentiment expressed in textual data. In this project, we aim to
build a sentiment analysis model to analyze customer sentiment towards a
particular product or service. The project involves collecting and preparing data
from customer feedback forums, performing exploratory data analysis to gain
insights into the data, and creating features for input into the sentiment analysis
model. We train and evaluate the model using traditional machine learning
algorithms and deep learning techniques, such as Naive Bayes, logistic
regression, Support Vector Machines, Recurrent Neural Networks, and
Convolutional Neural Networks. We deploy the model into a production
environment and monitor its performance over time. Our results show that the
sentiment analysis model can accurately classify customer feedback into
positive, negative, and neutral categories, with an accuracy of over 80%. The
model can be used to provide valuable insights to businesses to improve their
products and services and enhance customer satisfaction. The sentiment analysis
project aims to analyze and classify the sentiment expressed in textual data, such
as customer reviews, social media posts, or feedback comments. The goal is to
determine whether the sentiment conveyed in the text is positive, negative, or
neutral.

This project can provide valuable insights for businesses, brands, and
researchers to understand public opinion, customer satisfaction, and trends.
Table of Contents

TITLE PAGE NO.

CERTIFICATE ………………..…………………………….…..2
DECLARATION………..……………….....……………………3
ACKNOWLEDGEMENT..……………….……………………..4
ABSTRACT ………………….…………………………...…….5

1. Introduction 10-13

1.1) Basics……………………………………………………………….……….10
1.2) Need of sentiment analysis………………………………………….….…...10
1.3) Use Cases…………………...…………...…………………………………..11
1.3.1) Social Media Monitoring for Brand Management
1.3.2) Product/Service Analysis
1.3.3) Stock Price Prediction
1.4) Working of system ………………………………………………………….12
1.5) Approach………………………………………………………………….…13

2. Objectives of sentimental Analysis System 14-15

2.1 Sentiment Classification………………………………………………...……14


2.2 Accuracy and Performance…………………………………………….…….14
2.3. Data Collection and Preprocessing…………………………………….…....14
2.4. Model Training and Evaluation………………………………………...……14
2.5. Domain-Specific Sentiment Analysis…………………………………..…...14
2.6. Real-Time Analysis……………………………………………………..…..15
3. Literature Survey 16-18

4. Problem Description 19-21

4.1 Large Volume of Textual Data…………………………………………..………19


4.2 Customer Feedback and Reviews……………………………………………… 19
4.3 Brand Reputation Management…………………………………..……………..19
4.4 Customer Support and Service………………………………….………………20
4.5 Public Opinion and Political Analysis………………………………..…………20
4.6 Market Research and Competitive Analysis……………………………..……...20
4.7 Social Media Monitoring………………………………………..………………20
4.8 Subjectivity and Ambiguity in Text……………………………………………..20
4.9 Real-Time Analysis……………………………………………………………..21

5. Proposed Work 22-23

5.1. Data Collection…………………………………………………………………22


5.2. Text Preprocessing…………………………………………….………………..22
5.3. Feature Extraction…………………………………………..……….………….22
5.4. Sentiment Classification…………………………………………..………...….22
5.5. Model Training and Evaluation…………………………………….………..…23
5.6. Sentiment Prediction…………………………………………………………...23

6. Implementation 24-37

6.1 Description…………………………………………………………………………...24
6.2 Raw Data…………………………………………………………...………………...24
6.3 Split into Train/Test…………………………………………………….…………….29
6.4 Data exploration……………………………………………………………………...29
6.5 Correlations………………………………………………………………………..…33
6.6 Sentimental Analysis…………………………………………………………………34
6.6.1) Bag of Words Model……………………………………………………….….35
6.6.2) Multinomial Naive Bayes…….……………………………….………………36

7. Result and Analysis 38-41

8. Conclusion and future Scope 42-44


List of Figures

S No Description Page No

1 Sentiment Analyzer 11

2 Dataset used in project 25

3 Descriptive statistics 27

4 Visual representation 28

5 Bar graph based on sales 30


frequency

6 Average rating of each 31


ASINS

7 Average recommend rating of 32


ASINS

8 Correlations for attributes 33

9 Dot Graph of ratings 33

10 Classifier code 37
Chapter 1
Introduction

1.1 Sentimental Analysis


The sentiment analysis project aims to analyze and classify the sentiment expressed in
textual data, such as customer reviews, social media posts, or feedback comments. The
goal is to determine whether the sentiment conveyed in the text is positive, negative, or
neutral. This project can provide valuable insights for businesses, brands, and researchers
to understand public opinion, customer satisfaction, and trends.

Development of an accurate and robust sentiment analysis system that can effectively
classify sentiment in textual data, enabling businesses to gain insights, make informed
decisions, and take appropriate actions based on public sentiment and customer feedback.
We use various natural language processing (NLP) and text analysis tools to figure out
what could be subjective information. We need to identify, extract and quantify such
details from the text for easier classification and working with the data.

1.2 Need of Sentimental Analysis


Sentiment analysis serves as a fundamental aspect of dealing with customers on online
portals and websites for the companies. They do this all the time to classify a comment as a
query, complaint, suggestion, opinion, or just love for

a product. This way they can easily sort through the comments or questions and prioritize
what they need to handle first and even order them in a way that look better. Companies
sometimes even try to delete content that has a negative sentiment attached to it.
It is an easy way to understand and analyze public reception and perception of different ideas
and concepts, or a newly launched product, maybe an event or a government policy. Emotion
understanding and sentiment analysis play a huge role in collaborative filtering based
recommendation systems. Grouping together people who have similar reactions to a certain
product and showing them related products. Like recommending movies to people by
grouping them with others that have similar perceptions for a certain show or movie. Lastly,
they are also used for spam filtering and removing unwanted content.

1.3 Use cases

1.3.1 Social Media Monitoring for Brand Management:


Social Media Monitoring for Brand Management: Brands can use sentiment analysis
to gauge their Brand’s public outlook. For example, a company can gather all Tweets
with the company’s mention or tag and perform sentiment analysis to learn the
company’s public outlook.

1.3.2 Product/Service Analysis:


Brands/Organizations can perform sentiment analysis on customer reviews to see how
well a product or service is doing in the market and make future decisions accordingly.

1.3.3 Stock Price Prediction :


Predicting whether the stocks of a company will go up or down is crucial for investors.
One can determine the same by performing sentiment analysis on News Headlines of
articles containing the company’s name. If the news headlines pertaining to a particular
organization happen to have a positive sentiment — its stock prices should go up and
vice-versa

1.4 Working of Sentimental Analysis System


NLP or natural language processing is the basic concept on which sentiment analysis is built
upon. Natural language processing is a superclass of sentiment analysis that deals with
understanding all kinds of things from a piece of text. NLP is the branch of AI dealing with
texts, giving machines the ability to understand and derive from the text. For tasks such as
virtual assistant, query solving, creating and maintaining human-like conversations,
summarizing texts, spam detection, sentiment analysis, etc. it includes everything from counting
the number of words to a machine writing a story, indistinguishable from human texts.
Sentiment analysis can be classified into various categories based on various criteria.
Depending upon the scope it can be classified into document-level sentiment analysis, sentence
level sentiment analysis, and sub sentence level or phrase level sentiment analysis.

Also, a very common classification is based on what needs to be done with the data or the
reason for sentiment analysis. Examples of which are:

● Simple classification of text into positive, negative or neutral. It may also advance into
fine grained answers like very positive or moderately positive.

● Aspect-based sentiment analysis- where we figure out the sentiment along with a
specific aspect it is related to. Like identifying sentiment regarding various aspects or
parts of a car in user reviews, identifying what feature or aspect was appreciated or
disliked.
● The sentiment along with an action associated with it. Like mails written to customer
support. Understanding if it is a query or complaint or suggestion etc

1.5 Approach

Based on what needs to be done and what kind of data we need to work with there are two
major methods of tackling this problem.

● Matching rules based sentiment analysis: There is a predefined list of words for each
type of sentiment needed and then the text or document is matched with the lists. The
algorithm then determines which type of words or which sentiment is more prevalent in it.
This type of rule based sentiment analysis is easy to implement, but lacks flexibility and does
not account for context.

● Automatic sentiment analysis: They are mostly based on supervised machine learning
algorithms and are actually very useful in understanding complicated texts.
Algorithms in this category include support vector machine, linear regression, rnn,
and its types.
Chapter 2
Objectives of this project

2.1. Sentiment Classification:


The primary objective of sentiment analysis is to accurately classify text into different sentiment
categories, such as positive, negative, or neutral. The project aims to develop a model or system
that can effectively determine the sentiment expressed in text data.

2.2. Accuracy and Performance:


A key objective is to achieve high accuracy in sentiment classification. The project strives to
build a model or system that can make precise predictions and perform well on unseen data. It
involves training and fine-tuning the model to improve its performance metrics, such as
accuracy, precision, recall, or F1 score.

2.3. Data Collection and Preprocessing:


Another objective is to gather relevant text data from various sources and preprocess it
effectively. The project involves designing or utilizing mechanisms to collect data from social
media, customer reviews, news articles, or other text-based sources. Preprocessing techniques,
such as tokenization, stop word removal, and normalization, are applied to clean and prepare the
data for analysis.

2.4. Model Training and Evaluation:


The project aims to train a sentiment analysis model using appropriate machine learning or deep
learning techniques. It involves selecting suitable algorithms, feature extraction methods, and
model architectures. The model is evaluated using appropriate evaluation metrics to assess its
performance and effectiveness in sentiment classification.

2.5. Domain-Specific Sentiment Analysis:


In some cases, the objective is to develop a sentiment analysis system specifically tailored to a
particular domain or industry. For example, sentiment analysis in the financial sector may focus
on analyzing sentiment in stock market news or investor sentiment. The objective is to create a
model that can understand and analyze sentiment in domain-specific language or contexts.

2.6. Real-Time Analysis:


An objective may be to develop a sentiment analysis system that can perform real-time analysis
of streaming data. This involves designing or utilizing techniques to handle large volumes of
data in real-time and make sentiment predictions in a timely manner.
Chapter 3
Literature Survey

There have been numerous studies and research papers on sentiment analysis, exploring various
aspects, techniques, and applications of the field. Here are a few notable research papers that have
contributed to the advancement of sentiment analysis:

3.1 "Opinion Mining and Sentiment Analysis" by Pang and Lee (2008): [1]
This seminal paper provides an overview of sentiment analysis, discusses challenges and techniques
in opinion mining, and introduces the use of machine learning algorithms for sentiment
classification.
This survey covers techniques and approaches that promise to directly enable opinion-oriented
information seeking systems. Our focus is on methods that seek to address the new challenges raised
by sentiment aware applications, as compared to those that are already present in more traditional
fact-based analysis. It include material on summarization of evaluative text and on broader issues
regarding privacy, manipulation, and economic impact that the development of opinion-oriented
information-access services gives rise to. To facilitate future work, a discussion of available
resources, benchmark datasets, and evaluation campaigns is also provided.

Opinion-oriented extraction Many applications, such as summarization or question answering,


require working with pieces of information that need to be pulled from one or more textual units. For
example, a multi-perspective question-answering (MPQA) system might need to respond to
opinion-oriented questions such as “Was the most recent presidential election in Zimbabwe regarded
as a fair election?”; the answer may be encoded in a particular sentence of a particular document, or
may need to be stitched together from pieces of evidence found in multiple documents. Information
extraction (IE) is precisely the field of natural language processing devoted to this type of task .
Hence, it is not surprising that the application of information-extraction techniques to opinion
mining and sentiment analysis has been proposed. In this survey, we use the term opinion-oriented
information extraction (opinion-oriented IE) to refer to information extraction problems particular to
sentiment analysis and opinion mining. (We sometimes shorten the phrase to opinion extraction,
which shouldn’t be construed narrowly as focusing on the extraction of opinion expressions; for
instance, determining product features is included under the umbrella of this term.) Past research in
this area has been dominated by work on two types of texts: • Opinion-oriented information
extraction from reviews has, as noted above, attracted a great deal of interest in recent years. In fact,
the term “opinion mining”, when construed in its narrow sense, has often been used to describe work
in this context. Reviews, while typically (but not always) devoted to a single item, such as a product,
service, or event, generally comment on multiple aspects, facets, or features of that item, and all such
commentary may be important. Extracting and analyzing opinions associated with each individual
aspect can help provide more informative summarizations or enable more fine-grained
opinion-oriented retrieval. • Other work has focused on newswire. Unlike reviews, a news article is
relatively likely to contain descriptions of opinions that do not belong to the article’s author; an
example is a quotation from a political figure. This property of journalistic text makes the
identification of opinion holders (also known as opinion sources) and the correct association of
opinion holders with opinions important tasks, whereas for reviews, all expressed opinions are
typically those of the author, so opinion-holder identification is a less salient problem. Thus, when
newswire articles are the focus, the emphasis has tended to be on identifying expressions of
opinions, the agent expressing each opinion, and/or the type and strength of each opinion. Early
work in this direction first carefully developed and evaluated a low-level opinion annotation scheme
, which facilitated the study of sub-tasks such as identifying opinion holders and analyzing opinions
at the phrase level.

3.2 "Recursive Deep Models for Semantic Compositionality Over a Sentiment


Treebank" by Socher et al. (2013): [2]

This paper introduces the Recursive Neural Tensor Network (RNTN) model for sentiment
analysis, which captures compositional structure in sentences to improve sentiment prediction.

The paper discusses various compositional methods to combine words and phrases (n-gram) to
predict the binary (positive or negative) as well as fine-grained (very positive, positive, neutral,
negative, very negative) sentiments of words, phrases and whole sentences in a bottom-up
fashion. The main contribution of this paper is to introduce a parse tree based dataset with
fine-grained sentiment labels: “Stanford Sentiment Treebank” and proposes a neural
compositional model:
Recursive Neural Tensor Network (RNTN) that outperforms all previous recursive models and
achieves state-of-the-art performance.

The paper offered several important insights and observations:Models were compared with Naive
Bayes, SVMs, BiNB (NB with bigram features), VecAvg(average of word vectors). On
fine-grained classification for all phrases (at all node levels of the parse trees) RNTN achieves
best performance, followed by MV-RNN, RNN and other models. For binary classification on
sentence level, RNTN pushes state of the art accuracy from 80% to 85.4% .

Optimal performances for all the models were achieved for word vector dimension between 25
and 35, performance deteriorates for smaller and larger value of word vectors which confirms
RNTN performance enhancement is not dependent on its increased parameter size as MV-RNN
has largest number of parameters.

RNTN also captures the effect of negation in both positive and negative sentences. It has highest
accuracy for negating the positive sentences; it also increases non-negative activation ( degree of
non-negative sentiment in a sentence) for negation of negative sentence cases, which clearly
indicates the model learns the negation concept well beyond simple negation rules.

RNTN model is powerful in capturing the structural composition of the words and phrases in a
sentence and learning the effect of composition in detecting sentiments in a principled and
efficient way. The treebank dataset captures intricacies of linguistic phenomena; all models show
substantial improvement on their performances when trained on this new dataset. However, it is
to be noted that as RNTN requires the parse tree of the input sentences to be constructed; the
model might not perform well in cases of poor grammatical constructions such as dialogues in
chatbots or tweets. Another interesting case would be, to observe the effect of pre-trained word
embeddings such as word2vec, glove, fasttext on over all performance of the model instead of
learning the word vector embeddings as parameters during training.
Chapter 4
Problem Description

Any business is obliged to understand clients — their needs, their opinions, their satisfaction
with the product. In case of large web-based companies we need to analyse hundreds of
thousands or even millions of opinions to different products, and simply searching for
pre-defined “good” or “bad” words in the comments is not enough. With rise of machine
learning, in particular, deep neural networks, sentiment analysis — the problem of
understanding the emotional tone of a text has been solved with very high accuracy.

According to wikipedia:
A basic task in sentiment analysis is classifying the polarity of a given text at the document,
sentence, or feature/aspect level — whether the expressed opinion in a document, a sentence or an
entity feature/aspect is positive, negative, or neutral. Advanced, “beyond polarity” sentiment
classification looks, for instance, at emotional states such as “angry”, “sad”, and “happy”.

4.1 Large Volume of Textual Data


With the advent of social media, online reviews, and digital communication platforms, there is an
overwhelming amount of textual data generated daily. Analyzing this vast volume of text manually
for sentiment becomes infeasible, necessitating automated methods like sentiment analysis.

4.2 Customer Feedback and Reviews


Companies receive a significant amount of customer feedback and reviews, which are crucial for
understanding customer sentiment and satisfaction levels. Sentiment analysis helps businesses
process and analyze these large volumes of feedback to gain insights into customer opinions,
preferences, and sentiments.

4.3 Brand Reputation Management


Organizations need to monitor and manage their brand reputation in the digital age. Sentiment
analysis enables them to track and analyze public sentiment towards their brand, products, or
services, identifying potential issues or negative sentiment that could impact their reputation.

4.4 Customer Support and Service


Sentiment analysis can be applied to customer support interactions, such as analyzing customer
service tickets, chat logs, or call center recordings. It helps in understanding customer sentiment
and satisfaction levels, enabling companies to provide personalized and effective customer service.

4.5 Public Opinion and Political Analysis


Sentiment analysis has become valuable in understanding public sentiment towards political
figures, policies, social issues, or public events. It aids policymakers and researchers in gauging
public opinion, tracking sentiment trends, and making informed decisions.

4.6 Market Research and Competitive Analysis


Sentiment analysis provides insights into customer sentiment towards existing products, potential
product launches, or competitor analysis. It helps companies identify market opportunities,
understand customer needs and preferences, and gain a competitive advantage.

4.7 Social Media Monitoring


Social media platforms have become an important source of public opinion and sentiment
expression. Sentiment analysis allows businesses and organizations to monitor social media
conversations, track sentiment towards their brand, products, or campaigns, and engage with
customers effectively.

4.8 Subjectivity and Ambiguity in Text


Textual data often contains subjective language and ambiguous expressions that impact sentiment
interpretation. Sentiment analysis algorithms and models aim to capture the nuances of sentiment,
including sarcasm, irony, or figurative language, to provide accurate sentiment classification.
4.9 Real-Time Analysis
With the need for timely insights, sentiment analysis systems are designed to perform sentiment
analysis in real-time. Real-time sentiment analysis enables prompt response to emerging issues,
crises, or sentiment shifts.

These problems and challenges have driven the development and advancement of sentiment
analysis techniques and technologies, aiming to automate the analysis of sentiment in text data,
derive valuable insights, and support decision-making processes.
Chapter 5
Proposed Work

5.1. Data Collection:


The system gathers textual data from various sources, such as social media, customer reviews,
news articles, or any text-based content relevant to the analysis.

5.2. Text Preprocessing:


The collected text data undergoes preprocessing to remove noise and irrelevant information. This
process involves steps like tokenization (splitting text into individual words or tokens), removing
stop words (common words that carry little meaning), and applying techniques such as stemming
or lemmatization to reduce words to their root forms.

5.3. Feature Extraction:


Relevant features or attributes are extracted from the preprocessed text data. These features could
include word frequencies, n-grams (sequences of adjacent words), parts of speech, or semantic
features.

5.4. Sentiment Classification:


Using the extracted features, the sentiment analysis system applies machine learning techniques
to classify the sentiment of the text. There are different approaches for sentiment classification,
including:
5.4.1 Rule-based approach:
This approach involves defining a set of rules or patterns that indicate positive, negative, or
neutral sentiment. The system checks the presence or absence of these patterns in the text to
determine sentiment.

5.4.2 Supervised machine learning approach:


In this approach, the system is trained on a labeled dataset where each text sample is associated
with a sentiment label (e.g., positive, negative, neutral). It learns patterns from the training data
and builds a model that can classify new, unseen text based on those patterns.

5.4.3 Unsupervised machine learning approach:


Here, the system applies clustering or topic modeling techniques to group similar texts together.
The sentiment of these groups can then be inferred based on the sentiment of a few
representative samples.

5.5. Model Training and Evaluation:


The sentiment analysis system is trained on a labeled dataset where each text sample is associated
with its corresponding sentiment label. The performance of the trained model is evaluated using
evaluation metrics like accuracy, precision, recall, or F1 score to measure its effectiveness in
predicting sentiment accurately.

5.6. Sentiment Prediction:


Once the model is trained and evaluated, it can be used to predict the sentiment of new, unseen
text data. The system applies the trained model to the preprocessed text, assigns sentiment labels,
and provides an output indicating the sentiment polarity (positive, negative, or neutral) associated
with the text.

Overall, sentiment analysis aims to address the challenge of automatically understanding and
classifying sentiment in text data, considering factors such as context, subjectivity, ambiguity,
domain specificity, and real-time analysis requirements. Researchers and practitioners in the field
work on developing robust and accurate sentiment analysis models and systems to tackle these
challenges effectively.
Chapter 6
Implementation
6.1 Dataset
This dataset is based on Amazon branded/Amazon manufactured products only, and Customer
satisfaction with Amazon products seem to be the main focus.By using Sentiment analysis, we can
predict scores for reviews based on certain words
Potential suggestion for product reviews:
Product X is highly rated on the market, it seems most people like its lightweight sleek design and fast
speeds. Most products that were associated with negative reviews seemed to indicate that they were too
heavy and they couldn't fit them in the bags. We suggest that next gen models for e-readers are
lightweight and portable, based on this data.

6.1.1 Assumptions:
● Assuming that sample size of 30K examples are sufficient to represent the entire population of
sales/reviews.
● Assuming that the information in the text reviews of each product will be rich enough to train a
sentiment analysis classifier with accuracy (hopefully) > 70%

6.2. Quick look at the raw Data


● We potentially refine sentiment analysis with the reviews.text column, with the actual rating of
reviews.doRecommend column (boolean)
● We label each review based on each sentiment
■ The title contains positive/negative information about reviews.
Table 6.1 Dataset used in the project
Table 6.2 Descriptive statics of the dataset

Based on the descriptive statistics above, we see the following:

● Average review score of 4.58, with low standard deviation


■ Most review are positive from 2nd quartile onwards
● The average for number of reviews helpful (reviews.numHelpful) is 0.6 but high standard
deviation
■ The data are pretty spread out around the mean, and since can't have negative people
finding something helpful, then this is only on the right tail side

■ The range of most reviews will be between 0-13 people finding helpful (reviews.numHelpful).

Based on the information above:

● Drop reviews.userCity, reviews.userProvince, reviews.id, and reviews.didPurchase since these


values are floats (for exploratory analysis only)
● Not every category have maximum number of values in comparison to total number of values
● reviews.text category has minimum missing data (34659/34660) -> Good news!
● We need to clean up the name column by referencing asins (unique products) since we have 7000
missing values
Visualizing the distributions of numerical variables:

Fig. 6.3 Visual the distributions of numerical variables


Based on the distributions above:

● reviews.numHelpful: Outliers in this case are valuable, so we want to weight reviews that had
more than 50+ people who find them helpful
● reviews.rating: Majority of examples were rated highly (looking at rating distribution). There is
twice amount of 5 star ratings than the others ratings combined.

6.3 Split into Train/Test


● Goal is to eventually train a sentiment analysis classified.
● Since the majority of reviews are positive (5 stars), we do a stratified split on the reviews score
to ensure that we don't train the classifier on imbalanced data.
● To use sklearn's Stratified ShuffleSplit class, we're going to remove all samples that have NAN
in review score, then covert all review scores to integer datatype.

6.4 Data Exploration (Training Set)


Next, we explore the following columns:

● asins
● name
● reviews.rating
● reviews.doRecommend
● (reviews.numHelpful - not possible since numHelpful is only between 0-13 as per previous
analysis in Raw Data)
● (reviews.text - not possible since text is in long words)

Also, we explore columns to asins

Working hypothesis: there are only 35 products based on the training data ASINs

● One for each ASIN, but more product names (47)


● ASINs are what's important here since we're concerned with products. There's a one to many
relationship between ASINs and names
● A single ASIN can have many names due to different vendor listings
● There could also a lot of missing names/more unique names with slight variations in title (ie. 8gb
vs 8 gb, NAN for product names)
Confirmed hypothesis that each ASIN can have multiple names. Therefore we should only really concern
ourselves with which ASINs do well, not the product names.

Fig. 6.4.1 Bar graph showing product sales frequency based on ASINS

● Based on the bar graph for ASINs, certain products have significantly more reviews than other
products, which may indicate a higher sale in those specific products

● The ASINs have a "right tailed" distribution which can also suggest that certain products have
higher sales which can correlate to the higher ASINs frequencies in the reviews.
● The log of the ASINs to normalize the data, in order display an in-depth picture of each ASINs,
and we see that the distribution still follows a "right tailed" distribution
This answers the first question that certain ASINs (products) have better sales, while other ASINs have
lower sale, and in turn dictates which products should be kept or dropped.
6.4.2 reviews.rating / ASINs

Fig. 6.4.2 Average rating of each ASINS

● 1a) The most frequently reviewed products have their average review ratings in the 4.5 - 4.8
range, with little variance
● 1b) Although there is a slight inverse relationship between the ASINs frequency level and
average review ratings for the first 4 ASINs, this relationship is not significant since the
average review for the first 4 ASINs are rated between 4.5 - 4.8, which is considered good
overall reviews
● 2a) For ASINs with lower frequencies as shown on the bar graph (top), we see that their
corresponding average review ratings on the point-plot graph (bottom) has significantly higher
variance as shown by the length of the vertical lines. As a result, we suggest that, the average
review ratings for ASINs with lower frequencies are not significant for our analysis due to
high variance
● 2b) On the other hand, due to their lower frequencies for ASINs with lower frequencies, we
suggest that this is a result of lower quality products
● 2c) Furthermore, the last 4 ASINs have no variance due to their significantly lower
frequencies, and although the review ratings are a perfect 5.0, but we should not consider the
significance of these review ratings due to lower frequency as explained in 2a)

Fig. 6.4.3 Average doReccomend rating of each ASINS


● From this analysis, we can see that the first 19 ASINs show that consumers recommend the
product, which is consistent with the "reviews.rating / ASINs" analysis above, where the first 19
ASINs have good ratings between 4.0 to 5.0
● The remaining ASINs have fluctuating results due to lower sample size, which should not be
considered

6.5 Correlations
Table 6.5.1 Correlations for each attribute

Fig. 6.5.2 Dot graph of ratings with sale of ASIN

From our analysis in data exploration above between ASINs and reviews.rating, we discovered that there are
many ASINs with low occurrence that have high variances, as a result we concluded that these low occurrence
ASINs are not significant in our analysis given the low sample size.
Similarly in our correlation analysis between ASINs and reviews.rating, we see that there is almost no
correlation which is consistent with our findings.

6.6 Sentiment Analysis


Using the features in place, we will build a classifier that can determine a review's sentiment.

6.6.1 Set Target Variable (Sentiments)


Segregate ratings from 1-5 into positive, neutral, and negative.

Test code:

def sentiments(rating):

if (rating == 5) or (rating == 4):

return "Positive"

elif rating == 3:

return "Neutral"

elif (rating == 2) or (rating == 1):

return "Negative"

# Add sentiments to the data

strat_train["Sentiment"] = strat_train["reviews.rating"].apply(sentiments)

strat_test["Sentiment"] = strat_test["reviews.rating"].apply(sentiments)

strat_train["Sentiment"][:20]
6.6.2 Extract Features
Here we will turn content into numerical feature vectors using the Bag of Words strategy:

6.7 Bag of Words Model


Natural Language Processing technique of text modeling known as Bag of Words model.
Whenever we apply any algorithm in NLP, it works on numbers. We cannot directly feed our text
into that algorithm. Hence, Bag of Words model is used to preprocess the text by converting it
into a bag of words, which keeps a count of the total occurrences of most frequently used words.
This model can be visualized using a table, which contains the count of words corresponding to
the word itself.

Here we will turn content into numerical feature vectors using the Bag of Words strategy:

- Assign fixed integer id to each word occurrence (integer indices to word occurrence
dictionary.
- X[i,j] where i is the integer indices, j is the word occurrence, and X is an array of words
(our training set)

In order to implement the Bag of Words strategy, we will use SciKit-Learn's CountVectorizer to
performs the following:

* Text preprocessing:
* Tokenization (breaking sentences into words)
* Stopwords (filtering "the", "are", etc)
* Occurrence counting (builds a dictionary of features from integer indices with word
occurrences
* Feature Vector (converts the dictionary of text documents into a feature vector)

6.8 TFIDF

Here we have 27,701 training samples and 12,526 distinct words in our training sample.
Also, with longer documents, we typically see higher average count values on words that carry
very little meaning, this will overshadow shorter documents that have lower average counts with
same frequencies, as a result, we will use TfidfTransformer to reduce this redundancy:

Term Frequencies (Tf) divides number of occurrences for each word by total number of words.

Term Frequencies times Inverse Document Frequency (Tfidf) downscales the weights of each
word (assigns less value to unimportant stop words "the", "are".

6.9 We will use Multinominal Naive Bayes as our Classifier

Naive Bayes is a powerful algorithm that is used for text data analysis and with problems with multiple
classes. Bayes theorem, formulated by Thomas Bayes, calculates the probability of an event occurring
based on the prior knowledge of conditions related to an event. It is based on the following formula:
P(A|B) = P(A) * P(B|A)/P(B) eq-1.1

Where we are calculating the probability of class A when predictor B is already provided.

● Multinominal Niave Bayes is most suitable for word counts where data are typically represented
as word vector counts (number of times outcome number X[i,j] is observed over the n trials),
while also ignoring non-occurrences of a feature i
● Naive Bayes is a simplified version of Bayes Theorem, where all features are assumed
conditioned independent to each other (the classifiers), P(x|y) where x is the feature and y is the
classifier.

6.10 Test Model


Test code
import numpy as np
predictedMultiNB = clf_multiNB_pipe.predict(X_test)
np.mean(predictedMultiNB == X_test_targetSentiment)
0.9344498989315623

Here we see that our Multinominal Naive Bayes Classifier has a 93.45% accuracy level based on
the features.
Fig 6.6 Code for classifier

6.11 Fine tuning the Support Vector Machine Classifier

● Here we will run a Grid Search of the best parameters on a grid of possible values, instead of
tweaking the parameters of various components of the chain (ie. use_idf in tfidf transformer)
● We will also run the grid search with LinearSVC classifier pipeline, parameters and cpu core
maximization
● Then we will fit the grid search to our training data set
● Next we will use our final classifier (after fine-tuning) to test some arbitrary reviews
● Finally we will test the accuracy of our final classifier (after fine-tuning).

Note that Support Vector Machines are very suitable for classification by measuring extreme values
between classes, to differentiate the worst case scenarios so that it can classify between Positive,
Neutral and Negative correctly.
Chapter 7

Result and Analysis

7.1 Analysis of the model


For detailed analysis, we:

● Analyze the best mean score of the grid search (classifier, parameters, CPU core)
● Analyze the best estimator
● Analyze the best parameter
● Here the best mean score of the grid search is 93.65% which is very close to our accuracy
level of 94.08%
● Best estimator here is also displayed
● Lastly, best parameters are true for use_idf in tfidf, and ngram_range between 1,2

7.2 Summary of the classification report


Below is the summary of the classification report:

● Precision: determines how many objects selected were correct


● Recall: tells you how many of the objects that should have been selected were actually
selected
● F1 score measures the weights of recall and precision (1 means precision and recall are
equally important, 0 otherwise)
● Support is the number of occurrences of each class.

Results:

● After testing some arbitrary reviews, it seems that our features is performing correctly
with Positive, Neutral, Negative results
● We also see that after running the grid search, our Support Vector Machine Classifier has
improved to 94.08% accuracy level

The results in this analysis confirms previous data exploration analysis, where the data are very
skewed to the positive reviews as shown by the lower support counts in the classification report.
Also, both neutral and negative reviews has large standard deviation with small frequencies,
which we would not consider significant as shown by the lower precision, recall and F1 scores in
the classification report.

However, despite that Neutral and Negative results are not very strong predictors in this data set,
it still shows a 94.08% accuracy level in predicting the sentiment analysis, which we tested and
worked very well when inputting arbitrary text (new_text). Therefore, we are comfortable here
with the skewed data set. Also, as we continue to input new dataset in the future that is more
balanced, this model will then re-adjust to a more balanced classifier which will increase the
accuracy level.

Finally, the overall result here explains that the products in this dataset are generally positively
rated.
By considering only row 2-4 and column 2-4 labeled as negative, neutral and positive, the
positive sentiment can sometimes be confused for one another with neutral and negative ratings,
with scores of 246 and 104 respectively. However, based on the overall number of significant
positive sentiment at a score 6445, then confusion scores of 246 and 104 for neutral and negative
ratings respectively are considered insignificant.

Also, this is a result of a positively skewed dataset, which is consistent with both data exploration
and sentiment analysis. Therefore, we conclude that the products in this dataset are generally
positively rated, and should be kept from Amazon's product roster.

Results
The results of a sentiment analysis project can be evaluated based on several metrics and factors:

1. Accuracy: Accuracy is the most commonly used metric to measure the performance of a
sentiment analysis model. It indicates the proportion of correctly classified instances out of the
total number of instances. Higher accuracy indicates better performance. Precision and Recall:

2. Precision measures the proportion of true positive predictions (correctly identified positive
sentiments) out of all positive predictions made by the model. Recall, on the other hand,
measures the proportion of true positive predictions out of all actual positive instances in the data.
Both precision and recall are important in sentiment analysis as they provide insights into the
model's ability to correctly identify positive or negative sentiments.

3. F1-Score: The F1-score is the harmonic mean of precision and recall, providing a balanced
measure of the model's performance. It takes into account both precision and recall, making it a
useful metric when classes are imbalanced.

4. Confusion Matrix: A confusion matrix provides a detailed breakdown of the model's


predictions, showing the number of instances classified into each sentiment category (positive,
negative, neutral) and how they align with the ground truth labels. It helps to identify false
positives, false negatives, and assess the overall performance across different sentiment
categories.

5. Error Analysis: Analyzing the misclassified instances can provide insights into the limitations
and challenges faced by the sentiment analysis model. Understanding the types of errors made
(e.g., misinterpreting sarcasm, handling negations, or context-specific sentiments) can guide
further improvements in the model or the preprocessing techniques.

6. Domain-specific Evaluation: Depending on the application domain of the sentiment analysis


project, additional evaluation metrics or criteria specific to that domain may be considered. For
example, in financial sentiment analysis, the accuracy of predicting stock market movements
based on sentiment can be an important evaluation metric.

It's important to note that the performance and results of a sentiment analysis project can vary
based on the quality and size of the training data, the choice of algorithms and models, and the
preprocessing techniques employed. Continuous monitoring and evaluation are necessary to
ensure the model's performance remains consistent and reliable.
Chapter 8

Conclusion and Future Work


From the analysis above in the classification report, the products with lower reviews are not
significant enough to predict these lower rated products are inferior. On the other hand, products
that are highly rated are considered superior products, which also perform well and should
continue to sell at a high level.

As a result, we need to input more data in order to consider the significance of lower rated
products, in order to determine which products should be dropped from Amazon's product roster.

In conclusion, although we need more data to balance out the lower rated products to consider
their significance, we were still able to successfully associate positive, neutral and negative
sentiments for each product in Amazon's Catalog.

Conclusion
In conclusion, sentiment analysis is a valuable technique for automatically analyzing and
understanding the sentiments expressed in text data. The success of a sentiment analysis project
relies on data collection, preprocessing, model selection, training, and evaluation. By considering
metrics such as accuracy, precision, recall, F1-score, and analyzing the confusion matrix, we can
assess the performance of the sentiment analysis model

Future Work

Future work in sentiment analysis can focus on the following areas:

1. Fine-tuning and Transfer Learning: Leveraging pre-trained language models such as


BERT, GPT, or RoBERTa and fine-tuning them on specific sentiment analysis tasks can
potentially improve the performance of sentiment analysis models. Transfer learning
techniques can help capture semantic nuances and contextual information that may not be
captured effectively by traditional models.
2. Handling Context and Domain-Specific Sentiments: Sentiment analysis models often
struggle with understanding sentiments in specific contexts or domains. Future work can
explore techniques to handle domain-specific sentiments and context-aware sentiment
analysis. This could involve incorporating domain-specific knowledge, developing
specialized models for specific domains, or adapting existing models to different contexts.

3. Multimodal Sentiment Analysis: Sentiments can be expressed through various modalities


such as text, images, audio, or video. Future research can explore techniques to combine
and analyze multiple modalities to gain a more comprehensive understanding of
sentiments. This would involve developing models that can effectively integrate and
process multimodal data.

4. Emotion Analysis: Sentiment analysis primarily focuses on classifying text into positive,
negative, or neutral sentiments. However, emotions play a crucial role in understanding
human sentiments. Future work can involve developing models that can detect and
classify specific emotions such as joy, sadness, anger, or surprise.

5. Handling Sarcasm, Irony, and Figurative Language: Sentiment analysis models often
struggle with identifying sarcasm, irony, or sentiments expressed through figurative
language. Future research can explore techniques to better handle these linguistic
phenomena to improve the accuracy of sentiment analysis models.

6. Real-Time and Dynamic Sentiment Analysis: As sentiments change over time and in
response to events, real-time and dynamic sentiment analysis becomes crucial. Future
work can focus on developing models and algorithms that can analyze sentiments in
real-time, capture sentiment shifts, and adapt to changing sentiment patterns.

In summary, future work in sentiment analysis should aim to address the challenges of
domain-specific sentiments, context-aware analysis, multimodal data, emotions, linguistic
nuances, and real-time analysis to further enhance the accuracy and applicability of
sentiment analysis models in various domains and applications.
Chapter 9

References
[1]:

https://www.cs.cornell.edu/home/llee/omsa/omsa.pdf

[2]: :
https://www.researchgate.net/publication/284039049_Recursive_deep_models_for_semantic_co
mpositionality_over_a_sentiment_treebank

EMNLP2013_RNTN.pdf (stanford.edu)

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank - ACL
Anthology

You might also like