Mini Project Report
Mini Project Report
Mini Project Report
Font Size
A PROJECT REPORT
Submitted by
SAHANA J M (113219071033)
BHOOMIHA M (113219071003)
HARIPRIYA P (113219071012)
BACHELOR OF TECHNOLOGY IN
ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
BONAFIDE CERTIFICATE
Certified that this project report titled “SENTIMENT ANALYSIS ON RESTAURANT REVIEWS”
is the bonafide work of Ms. SAHANA J M (113219071033), Ms. BHOOMIHA M (113219071003),
Ms. HARIPRIYA P (113219071012), who carried out the Mini project work under my supervision.
SIGNATURE SIGNATURE
ii
MINI PROJECT EXAMINATION
Of Third year Bachelor of Technology in Artificial Intelligence and Data Science submitted for the
INTERNAL EXAMINER
iii
ABSTRACT
iv
ACKNOWLEDGEMENT
v
TABLE OF CONTENTS
ACKNOWLEDGEMENT v
TABLE OF CONTENTS vi
1 INTRODUCTION 1
1.1 PROJECT OUTLINE 1
TOOLS/PLATFORM 1
INTRODUCTION 1
MOTIVATION 3
PROBLEMS 4
3 SOFTWARES USED
LIBRARIES/MODULES USED 9
TOOLS USED 9
3.2.1 FROM NLTK 9
3.2.2 FROM SCIKIT LEARN 9
3.2.3 FROM MATPLOTLIB 9
4 MODULE IMPLEMENTATION
vi
4.1 DATA PRE – PROCESSING 10
4.1.1 DATA CLEANING 10
STOPWORDS 11
STEMMING 12
4.2 DATA TRANSFORMATION 14
4.2.1 COUNT VECTORIZER 14
4.2.2 CORPUS 15
4.2.3 PICKLE 16
4.2.4 BAG OF WORDS MODEL 17
4.6.1 MATPLOTLIB 29
vii
CHAPTER 1
INTRODUCTIO
N
Project Outline:
Tools/Platform:
Introduction:
As internet is growing bigger, its horizons are becoming wider. Social Media
and Micro blogging platforms dominate in spreading recommend places based on
reviews across the globe at a rapid pace. A topic becomes trending if more and
more users are contributing their opinion and judgments, thereby making it a
valuable source of online perception. Large organizations and firms take
advantage of people's feedback to improve their products and services which
further help in enhancing marketing strategies. Thus, there is a huge potential of
discovering and analyzing from the infinite social media data for business- driven
applications.
The social media sites offer a platform to people to voice their opinions. For
example, people quickly post their reviews online as soon as they have a food in
a restaurant and then start a series of comments to discuss about the ambience of
1
restaurant. This kind of information forms a basis for the people to evaluate, rate
about the performance of not only any restaurant but about other products and to
know about whether it will be a success or not. This type of vast information on
these sites can be used for marketing and social studies.
2
rare. Generally, restaurant customer satisfaction analyses through product data,
nutrition data and food preparation.
Restaurant reviews are still in the form of text, customer reviews are included
in the text mining category, and the results of these data will be classified into
two values, positive or negative. Retrieving data for preprocessing review data
such as remove stop word, remove punctuation is done with the help of Python,
while for classifying data using Waikato Environment for Knowledge Analysis
(WEKA) software with the Naive Bayes method and also using Tex Blob whichis
a python-based sentiment analyzer to compare. Naive Bayes is chosen because
this method has been widely implemented in sentiment analysis.
Another aim is to find the best method for analyzing restaurant customer
review data by comparing Naive Bayes method and Text Blob sentiment analysis
sincethe two methods have fundamental differences in terms of calculation.
MOTIVATION:
3
positive comment and the comments which the words “bad”, “miserable” can be
classified as a negative comment, think again. Ex.: “Completely lacking in good
taste” and “Good for a quick meal but nothing special” representsa negative and
neutral feedback respectively even though they have the word “good” in them.
Therefore, as I mentioned the task may not be as easy as it mayseem. Let’s move
on to the data we will be working to solve the problems.
PROBLEMS
For year’s food and hospitality businesses are running on the assumption that
good food and service is the way to attract more customers. But the advent of
science and technology, more importantly, the data created by the use of online
platforms has pointed towards new findings and opened new doors: Most
consumers nowadays rate a product online, over 1/3rd of them write reviews and
nearly 88% of the people trust online reviews. Review Services like Yelp, Google
Reviews, etc. provide customers and businesses a way to interact with one
another.
The main objective of the work proposed in this paper is to enhance the user
experience by analyzing the reviews of restaurants and categorize them in some
aspects so that a user can easily know about the restaurant. Restaurants are not
able to utilize reviews for their profits. We want to use the aspects that are
important in the food and service industry so that we can analyze the sentimentof
text reviews and help them to improve their businesses.
4
CHAPTER 2
LITERATURE SURVEY
When diving into this literature work, we found that a large amount of
relevant work has already been done in this field but what was missing was the
fact that most of them were not industry oriented. We tried to incorporate the
most noticeable findings in these works as our base so we could build upon the
work already done. Most of the work focuses on improving the models for
classification. The Fakeness of the reviews is a common problem that arises.
Some deep learning techniques have also been compared with classical
techniques. Some of these works have been summarized in the subsequent
sections.
Customer satisfaction is an essential concern in the field of marketing and
research in terms of consumer behavior. As in the habits of hotel consumers
when they get excellent service, they will transmit to others mouth to mouth Text
mining or retrieval of data from a collection of documents stores frequently with
the help of analysis tools or manuals. Through the analysis process of several text
mining perspectives, information can be produced that can be used to increase
profits and services. Sentiment analysis is used to find opinions from the author
about a specified entity. Sentiment analysis of a review is an opinion
investigation of a product. The basis of sentiment analysisis using Natural
Language Processing (NLP), text analysis and some computational portions to
extract or omit unnecessary parts to see the pattern of the sentence negative or
positive.
In the l5th century, Reverend Thomas Bayes developed a method known
as Naive Bayes that used probability and opportunity approaches. Naive Bayes
calculates future probability predictions from data or experiences that have been
given, based on the opportunity point of view . One characteristic of the Naive
Bayes Classification is the existence of independent input variables which
assume the presence of an articular feature from a class that is mutually
independent of other features.
The phrases and expressions of n-gram give it an edge over all other
techniques used for a technical set of challenges. The results explained the
effectiveness ofsentiment analysis challenges for improving the accuracy of the
model.
6
Assessing the Helpfulness of Online Hotel Reviews:
It can be helpful in optimizing the cost of the search for most of the
consumers by using feature engineering.
This paper paves an idea of the challenges that we could face in our
research. Specifying the significance of online feedback for different types of
industries and the amount of difficulty attached in procuring and maintaining a
favorable honor on the Internet, diverse methods have been used to enhance
digital existence, including unethical practices.
Fake reviews are one of the most preferred unethical methods which exist
on sites. In response to that Fake Feature Framework (F3), helps to assemble and
constitute features for fake reviews choice.
7
orientation of phases in the review which comprises adverbs and adjectives. It is
expected that there will be an efficient value when we merge semantic orientation
with sentiments. The review is recommended only if the mean is positive and
otherwise not it is recommended. The Naive Bayes model generally performs
better than SVM.
Proposed System:
FLOW DIAGRAM:
8
CHAPTER 3
SOFTWARES
USED
LIBRARIES/MODULES USED:
TOOLS USED:
FROM NLTK:
Corpus
Stop words
Port Stemmer
BagOfWords(BOW)
10
CHAPTER 4
MODULE IMPLEMENTATION
The main approach involved in this project is the various data pre-
processing steps, the machine learning classifiers and feature extraction. The
main Machine Learning Algorithm used is Naïve Bayes. The main data pre-
processing methods include Stop words removal, Stemming.
STEPS:
1. Importing Dataset
2. Data Preprocessing
3. Data Transformation
4. Diving Dataset into training Set and Test set
5. Model Training
6. Checking Performance
7. Importing fresh Dataset
8. Data Preprocessing
9. Data Transformation
10. Predictions
11. Data Visualization based on Prediction (Pie-Chart)
DATA PRE-PROCESSING:
DATA CLEANING:
While it may seem like an easy task when we manually edit a couple of
hundred comments, for example, TikTok insight but when we have to assess
numerous videos, say when doing an Instagram analysis, such scenarios mean
videos with an aggregated sum of comments running into thousands. In such a
11
case, we need an automated sentiment analysis tool. However, for that tool to
give you accurate and high-precision results, we need to make sure that you have
a high-quality dataset prepped for analysis.
Stopwords
PortStemmer
STOPWORDS:
The words which are generally filtered out before processing a natural
language are called stop words. These are actually the most common words in
any language (like articles, prepositions, pronouns, conjunctions, etc) and do not
add much information to the text. Examples of a few stop words in English are
“the”, “a”, “an”, “so”, “what”.
12
Removal of stop words definitely reduces the dataset size and thus reduces
the training time due to the fewer number of tokens involved in the training.
Do we always remove stop words? Are they always useless for us?
The answer is no!
We do not always remove the stop words. The removal of stop words is
highly dependent on the task we are performing and the goal we want to achieve.
For example, if we are training a model that can perform the sentiment analysis
task, we might not remove the stop words.
good”
We can clearly see that the review for the movie was negative. However,
after the removal of stop words, the review became positive, which is not the
reality. Thus, the removal of stop words can be problematic here.
Tasks like text classification do not generally need stop words as the other
words present in the dataset are more important and give the general idea of the
text. So, we generally remove stop words in such tasks.
In a nutshell, NLP has a lot of tasks that cannot be accomplished properly after
the removal of stop words. So, think before performing this step. The catch here
is that no rule is universal and no stop words list is universal. A list not
conveying any important information to one task can convey a lot of information
to the other task.
STEMMING:
14
Stemming is the process of producing morphological variants of a
root/base word. Stemming programs are commonly referred to as stemming
algorithms or stemmers. A stemming algorithm reduces the words “chocolates”,
“chocolatey”, “choco” to the root word, “chocolate” and “retrieval”, “retrieved”,
“retrieves” reduce to the stem “retrieve”. Stemming is an important part of the
pipelining process in Natural language processing. The input to the stemmer is
tokenized words. How do we get these tokenized words? Well, tokenization
involves breaking down the document into different words.
We’ll use the Porter stemmer here because the rules associated with suffix
removal are much less complex in case of Porter's Stemmer and it uses a single,
unified approach to the handling of context.
15
Example: EED -> EE means “if the word has at least one vowel and consonant
plus EED ending, change the ending to EE” as ‘agreed’ becomes ‘agree’.
DATA TRANSFORMATION:
COUNT VECTORIZER:
16
We have 8 unique words in the text and hence 8 different columns each
representing a unique word in the matrix. The row represents the word count.
Since the words ‘is’ and ‘my’ were repeated twice we have the count for those
particular words as 2 and 1 for the rest. CountVectorizer makes it easy for text
data to be used directly in machine learning and deep learning models such as
text classification.
CORPUS:
17
positive, negative and neutral sentiments for a document.
USE OF CORPUS
Corpora are essential in particular for the study of spoken and signed
language: while written language can be studied by examining the text, speech,
signs and gestures disappear when they have been produced and thus, we need
multimodal corpora in order to study interactive face-to- face communication.
EXAMPLE
PICKLE:
Why Pickle? In real world scenario, the use pickling and unpickling are
widespread as they allow us to easily transfer data from one server/system to
another and then store it in a file or database.
18
Pickle a simple list: Pickle_list1.py
import pickle
mylist = ['a', 'b', 'c', 'd']
with open('datafile.txt', 'wb') as fh:
pickle.dump(mylist, fh)
In the above code, list – “mylist” contains four elements (‘a’, ‘b’, ‘c’, ‘d’).
We open the file in “wb” mode instead of “w” as all the operations are done
using bytes in the current working directory. A new file named “datafile.txt” is
created, which converts the mylist data in the byte stream.
What is a Bag-of-Words?
A bag-of-words model, or BoW for short, is a way of extracting features
from text for use in modelling, such as with machine learning algorithms.
The approach is very simple and flexible, and can be used in a myriad of
ways for extracting features from documents.
19
A bag-of-words is a representation of text that describes the occurrence of
words within a document. It involves two things:
20
DIVIDING DATASET:
Train test split is a model validation procedure that allows you to simulate how
a model would perform on new/unseen data. The main function is to Split arrays
or matrices into random train and test subsets. Here is how the procedure works.
Step 1. Make sure your data is arranged into a format acceptable for train test
split. In scikit-learn, this consists of separating your full dataset into Features and
Target.
Step 2. Split the dataset into two pieces: a training set and a testing set. This
consists of randomly selecting about 75% (you can vary this) of the rows and
putting them into your training set and putting the remaining 25% to your test set.
Note that the colors in “Features” and “Target” indicate where their data will go
(“X_train”, “X_test”, “y_train”, “y_test”) for a particular train test split.
Step 3. Train the model on the training set. This is “X_train” and “y_train” in the
image.
Step 4. Test the model on the testing set (“X_test” and “y_test” in the image) and
evaluate the performance.
21
MODEL FITTING:
NAÏVE BAYES:
Naive Bayes is the simplest and fastest classification algorithm for a large
chunk of data. In various applications such as spam filtering, text classification,
sentiment analysis, and recommendation systems, Naive Bayes classifier is used
successfully. It uses the Bayes probability theorem for unknown class prediction.
Simple Bayes or independent Bayes models are other names for nave
Bayes models. All of these terms refer to the classifier’s decision rule using
Bayes’ theorem. In practice, the Bayes theorem is applied by the Naive Bayes
classifier. The power of Bayes’ theorem is brought to machine learning with this
classifier.
22
BAYES THEOREM:
Using Bayes theorem, we can find the probability of class A given the
features B, the class that gives the maximum probability that the given features
predict, is our desired result.
In this experiment, the Naive Bayes classifier from NLTK was used to train and
test the data.
𝑦 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦𝑃(𝑦)𝜋𝑛𝑖=1 𝑃(𝑥𝑖|𝑦)
where, y is class variable and X is a dependent feature vector (of size n).
Please note that P(y) is also called class probability and P(xi | y) is
called conditional probability.
The different naive Bayes classifiers differ mainly by the assumptions they
make regarding the distribution of P(xi | y).
23
TYPES OF NAÏVE BAYES MODEL:
There are three types of Naive Bayes Model, which are given below:
24
JOBLIB:
In particular,
• transparent disk-caching of functions and lazy re-evaluation
(memorize pattern)
• easy simple parallel computing
Joblib is optimized to be fast and robust on large data in particular and has
specific optimizations for NumPy arrays. It is BSD-licensed.
Joblib is one of the python libraries that provides easy to use interface for
performing parallel programming in python. The machine learning library scikit-
learn also uses joblib behind the scene for running its algorithms in
parallel. joblib is basically wrapper library which uses other libraries for running
code in parallel. It also lets us choose between multi-threading and multi-
processing. joblib is ideal for a situation where you have loops and each iteration
through loop calls some function which can take time to complete. This kind of
function whose run is independent of other runs of the same functions in for loop
is ideal for parallelizing with joblib.
25
MAIN FEATURES
26
CHECKING MODEL PERFORMANCE:
CONFUSION MATRICS
27
You can compute the accuracy test from the confusion matrix:
ACCURACY SCORE:
Confusion Matrix (Also called Error Matrix.) - This would also tell you
how your system is performing and is similar to F1-Measure.
28
Accuracy score can be calculated using the formula:
Here we can also calculate accuracy with the help of the accuracy
score method from sklearn.
Syntax:
sklearn.metrics.accuracy_score(y_true,y_pred,normalize=False,
sample_weight=None)
Parameters:
The set of labels that predicted for the sample must exactly
match the corresponding set of labels in y true.
Accuracy that defines how the model performs all classes. It is
useful if all the classes are equally important.
The accuracy of the model is calculated as the ratio between the
numbers of correct predictions to the total number of
predictions.
29
DATA VISUALIZATION:
Data Visualization states that after data has been collected, processed
and modelled, it must be visualized for conclusions to be made. Data
visualization is also an element of the broader data presentation architecture
(DPA) discipline, which aims to identify, locate, manipulate, format and
deliver data in the most efficient way possible.
30
The visualization process involves generally four steps:
1. Load and prepare the datasets: Normally you will pick a data set
and visualize its observations. But the dataset must be cleaned
first, filling of empty cells must be done, change categorical
variables to numeric if necessary, and detecting outlier sometimes.
If you clean the dataset before visualization the result will be more
trustworthy.
3. Plot the graph: After importing the libraries you will set many
hyperparameters for size and display, and pass the datasets which
will be visualized and then plot the diagram with proper syntax.
MATPLOTLIB:
Types of Plots:
1
Bar
Make a bar plot.
2
Barh
Make a horizontal bar plot.
3
Boxplot
Make a box and whisker plot.
4
Hist
Plot a histogram.
5
hist2d
Make a 2D histogram plot.
6
Pie
Plot a pie chart.
7
Plot
Plot lines and/or markers to the Axes.
8
Polar
Make a polar plot..
9
Scatter
Make a scatter plot of x vs y.
32
10
Stackplot
Draws a stacked area plot.
11
Stem
Create a stem plot.
12
Step
Make a step plot.
13
Quiver
Plot a 2-D field of arrows.
PIE-CHART:
A Pie Chart can only display one series of data. Pie charts show the
size of items (called wedge) in one data series, proportional to the sum of the
items. The data points in a pie chart are shown as a percentage of the whole
pie.
With Pyplot, you can use the pie () function to draw pie charts.
plt.pie(y)
plt.show()
33
Basic Parameters:
Y :1D array-like
explodearray-like, default: None
labelslist, default: None
colorsarray-like, default:
None Returns:
Patches : list
Texts : list
Autotexts : list
34
CHAPTER 5
OUTPUT:
Here dataset.head() retrieves the first 5 rows from the given dataset.
CODING
###Importing libraries
import numpy as np
import pandas as pd
dataset = pd.read_csv('./drive/MyDrive/Sentiment_Analysis1/Pr
oject2_Sentiment_Analysis/a1_RestaurantReviews_HistoricDump.tsv',
delimiter = '\t', quoting = 3)
dataset.shape
dataset.head()
35
CHAPTER 6
CONCLUSION:
FUTURE ENHANCEMENT:
36
REFERENCES