Sentiment Analysis of Reviews Using Machine Learning
Sentiment Analysis of Reviews Using Machine Learning
Sentiment Analysis of Reviews Using Machine Learning
MACHINE LEARNING
A MINI PROJECT REPORT
BACHELOR OF ENGINEERING
In
Batch 1
Viva-Voce Committee:
Sl.
Name Designation Signature with Date
No.
1
Our project focuses on sentiment analysis of the costumer’s reviews in one of the
most trending e-commerce platform which is women’s clothing shopping sites
using machine learning.
Machine learning concept helps to improve the shopping experience by
considering the personal preferences and recommend the consumer while they do
a new purchase based on the history providing personalization. Sentiment
analysis allows e-commerce platforms to understand the opinions of customer
feedback. Along with understanding the emotions of customer feedback it also
analyzes the opinions for a particular reason.
The dataset includes attributes like: Clothing ID, Age, Title, Review Text, Rating,
Recommended IND, Positive Feedback Count, Division Name, Department
Name, and Class Name.
We attempt to understand the correlation of different variables in customer
reviews on a women clothing e-commerce, and to classify each review by the
depth meaning of the words and these words further helps us to predict whether
the reviewed product is recommended or not and whether it consists of positive,
negative, or neutral sentiment. To achieve these goals, we employed Multinomial
Naive Bayes algorithm.
To understand the dataset we are using data representation techniques like bar
graphs which showed the reviews vs. age and category. We also create confusion
matrix to check for the efficiency of our classifier.
ABSTRACT 3
LIST OF FIGURES 6
LIST OF TABLES 7
1 INTRODUCTION 8
1.1 AIM 9
1.2 OBJECTIVE 9
2 THEORETICAL BASIS 10
3 SYSTEM ANALYSIS 11
4 WORKFLOW 12
4.1 PRE-PROCESSING 12
4.2 POLARITY 13
METHODOLOGY AND
5 IMPLEMENTATION 15
6 RESULT 25
PERFORMANCE MEASUREMENT OF
6.3.3 RECOMMENDED IND 32
PERFORMANCE MEASUREMENT OF
6.3.4 RATING 32
7.1 CONCLUSION 33
REFERENCES 34
Fig no Description
4.1 Pre-processing
4.2 Polarizing
5.2 Sentiment.polarity
Table no Description
5.1 Dataset
INTRODUCTION
Online shopping portals use the reviews as a tool for understanding their customers, in order to
further improve their products and/or services. Text analysis has become an active field of research
in computational linguistics and natural language processing. One of the most popular problems in
the mentioned field is text classification, a task which attempts to categorize documents to one or
Public opinion plays a vital role in business organization to market the products, venture new
opportunities and for sales prediction. This is achieved by searching suitable information from the
accumulated data pertaining to the user’s history of visiting particular places. A large amount of
data can be analyzed and prediction of opinion is possible using a sentiment analysis technique
Consumers rely on online reviews for direct information to make purchase decisions. But, the
presence of huge set of reviews for the same product makes it almost impossible to go through and
realize the quality of the product. People nowadays pay more attention to their appearances and
comforts especially when it concerns clothes, wearing different clothes depending on with whom
they will be meeting or avoiding not wearing the same clothes the next day.
This approach is widely known as sentiment analysis that uses statistics and natural language
Our aim is to analyze and classify the reviews present in the dataset based on category of clothes
and age group and hence provide the recommendations and rating based on the customer’s new
1.2 OBJECTIVE
To focus on e-commerce reviews on women clothing shopping sites, where our aim is to
To help the retailers, the e-commerce platforms will use the summary of customer feedback
To help the existing and prospective customers in deciding the products of their interests.
Sentiment analysis for the given e-commerce dataset on women’s clothing and predict the
Also provide with correct recommendation IND and rating values based on the review text.
When choosing a particular cloth the customer at present has to go through the reviews presented
to them. When selecting through just a few reviews, it can be most effectively done through simply
reading the comments. However, if you have thousands and sometimes even hundreds of
thousands of reviews on a consistent basis, reading all the feedback can be difficult if not
impossible. Currently, Sentiment analysis can be very useful because it can help businesses to
differentiate themselves in certain ways and therefore stand out from the crowd of competitors to
We analyse the whole dataset and come up with a classifier by using machine learning concepts
like Multinomial Naïve Bayes algorithm. We also analyse the whole dataset to support the
correctness of our classifier. Thus, sentiment analysis can be of great help in providing directional
insight and for paving the way for further analysis, insight and appropriate action.
SYSTEM ANALYSIS
• Python-V 3
• Jupyter Notebook
• NLTK Library
• Matplotlib Library
• Tkinter library
WORK FLOW
4.1 PRE-PROCESSING
Firstly, we remove the unwanted columns in our dataset. Later we send the review text columns
for preprocessing.
Now we remove the unnecessary data such as any non-alphanumeric characters, hyperlinks, stop words,
etc. from the Review text and Convert all characters to lowercase, in order to achieve consistency.
The pre-processed data is polarized to obtain positive , negetive, neutral emotions of the
review.
Now, the analyzer takes the best positive emotions( with value 1) along with user interests to
• In this approach, each element in the vector corresponds to a unique word (token) in the
corpus vocabulary. Then, if the token at a particular index exists in the document, that element
• These words are then sent into a splitting module where you define a splitting index to split
• This training data is provided to the Multinomial naïve Bayes algorithm by which we get
• We then send the test data to the classifier which applies the function and provides with the
desired result.
The dataset used is Women’s Clothing E-Commerce dataset revolving around the reviews
written by customers.
Its nine supportive features offer a great environment to parse out the text through its
multiple dimensions.
The attributes includes: independent attributes like clothing ID, Age, Department name,
Title.
Dependent attributes like Division name and Class name depends on Department name and
Clothing ID. Review, Rating and positive feedback depends on Age and Title.
• We first extract the main attributes from the given dataset which includes Clothing ID,
• Since, we have to pre-process and clean the text we extract the Review Text attribute
column.
Use repalce() function along with the re library which hs regular expressions to
• Remove stop words like and, are, because, at etc. from review text by comparing it with the
• Tokenize the text by separating it into individual words to the specific tokens (example:
word “Beautiful” = adjective = token (JJ)) in order to convert the words to vector form.
5.3COMPUTING POLARITY
Pre-process data is sent to textblob library which has a sentiment module which has
and rates the words in the range of -1 to 1 where -1 to 0 being negetive words and 0 to 1
As our objective is to provide the customer the clothing IDs which are highly reccomended
analysed from the reviews,we consider the clothing ID’s which has the highest possible
Based on the user interests like age group or catogory of clothes, the analyser suggests the
clothes.
Build a vocabulary of all the unique words in our dataset, and associate a unique
Now we compare this vocabulary of words with our filtered dataset and fit the
filtered words with the corresponding values based on importance (weightage) present in
• Next the transform function creates a matrix of size 23,486 review text X number of words
corresponding position in sparse matrix. At each index in this list, we mark how many times
the given word appears in our sentence. This process is done using fit_transform function.
• For visualization purpose we use get_feature_name function which finds the word that was
occurred maximum number of times and also its weight in the reviews for a particular clothing
id.
matrix into training and testing data in the ratio of 80:20 using test_size parameter.
• Now we decide the classification model that we wish to use to train our prediction model.
word/token (random variable):
Where:
P(c) indicates the priors of class
P(w|c) is the probability of the word occuring given the class is true.
Probabilty of the class given the document , P(c|d5) = Priors * (P(w|c) for each word)
library.
• Our ultimate goal is to train our model to learn the probabilities needed in order to make a
classification decision. We achieve this by training the model by giving it the training set of
data.
• Now we use the prediction model that we obtained after training, to compute further
predictions. We pass the test data to the predict function which gives us the prediction
The histogram is drawn for age vs. number of reviews. With its help we realized that maximum
This bar graph is drawn for category vs. number of reviews helped us to realize that dresses are
This bar graph is drawn for clothing ID vs. popularity which helped us realize that the clothing
ID 1078 has the most reviews and thus, making it the most popular items.
7.1 CONCLUSION
Through this project we were able to explore the vast libraries and modules present in NLP and
also understand and implement few of machine learning algorithms. Sentiment analysis helped us
to provide the users with best recommended clothing ID. Also, Multinomial Naïve Bayes
algorithm helped us to provide the retailers an easy and efficient way to know whether the users
have genuine interests in their products by providing them with true ratings and recommended
IND. Thus, we were able to achieve our goals of improving user experiences and retailers service.
• Increasing the capacity of training set and hence increasing the efficiency of the prediction
1) Abien Fred M. Agarap, Department of Computer Science Adamson University Manila, Philippines
abien.fred.agarap@adamson.edu.ph ,Paul M. Grafilon, Ph.D.† Department of Computer Science
Adamson University Manila, Philippines grafilonpaul@yahoo.com, Statistical Analysis on E-
Commerce Reviews, with Sentiment Classification using Bidirectional Recurrent Neural Network
2) Geoff Hulten, Apress Media LLC publishers, Building Intelligent Systems, A Guide to
Machine Learning Engineering
3) Minqing Hu and Bing Liu ,Mining and Summarizing Customer Reviews , Department of
Computer Science University of Illinois at Chicago 851 South Morgan Street Chicago, IL
60607-7053 {mhu1, liub}@cs.uic.edu
4) Paul Barry (2nd Edition), Head First Python, O’Reilly publications.
5) Sasikala P*1 , L.Mary Immaculate Sheela#2 *Research Scholar, Department of Computer science,
Mother Teresa Women’s University, Kodaikanal, India, International Journal of Applied
Engineering Research ISSN 0973-4562 Volume 13, Number 14 (2018) pp. 11525-11531 ©
Research India Publications. http://www.ripublication.com, Sentiment Analysis and Prediction of
Online Reviews with Empty Ratings
6) Vishal A. Kharde, S S Sonawane, (April 2016), Sentiment Analysis of Twitter Data: A
Survey of Techniques, International Journal of Computer Applications (0975 – 8887)
Volume 139 – No.11.
7) https://www.digitalocean.com/community/tutorials/how-to-set-up-jupyter-notebook-for-
python-3
8) https://www.digitalocean.com/community/tutorials/how-to-work-with-language-data-in-
python-3-using-the-natural-language-toolkit-nltk
9) https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/
10) https://www.youtube.com/watch?v=3Pzni2yfGUQ&feature=youtu.be
11) https://www.kaggle.com/