Mini Project Report

Font Type: Times New Roman
Font Size
1. Headings: Font Size: 14(Bold)

2. Abstract: Font Size: 14 Justify Line Spacing : 2
3. Acknowledgement: Font Size: 14 Justify Line Spacing : 2
4. Sub Headings: Font Size:12 (Bold)
5. Contents-Paragraph: Font Size: 12
6. Line Spacing: Single Spacing (Justify)
Table of Contents: Font type:Times New Roman(Capital Letters)
Size:12
SENTIMENT ANALYSIS ON RESTAURANT REVIEWS
A PROJECT REPORT
Submitted by
SAHANA J M (113219071033)
BHOOMIHA M (113219071003)
HARIPRIYA P (113219071012)
In the partial fulfillment of the award of the degree of
BACHELOR OF TECHNOLOGY IN
ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
VELAMMAL ENGINEERING COLLEGE

CHENNAI
ANNA UNIVERSITY, CHENNAI - 600 025

APRIL 2024
VELAMMAL ENGINEERING COLLEGE
DEPARTMENT OF ARTIFICIAL INTELLIGENCE
AND DATA SCIENCE
BONAFIDE CERTIFICATE
Certified that this project report titled “SENTIMENT ANALYSIS ON RESTAURANT REVIEWS”
is the bonafide work of Ms. SAHANA J M (113219071033), Ms. BHOOMIHA M (113219071003),
Ms. HARIPRIYA P (113219071012), who carried out the Mini project work under my supervision.
SIGNATURE SIGNATURE
INTERNAL GUIDE, HEAD OF THE DEPARTMENT,
Guide Name, Dr.P.Visu,

Assistant Professor, Professor and Head,
Department of Artificial Intelligence and Data Department of Artificial Intelligence and
Science, Data Science,
Velammal Engineering College, Velammal Engineering College,
Chennai - 600066. Chennai - 600066.
ii
MINI PROJECT EXAMINATION
The MINI PROJECT Examination of this project work “SENTIMENT ANALYSIS ON

RESTAURANT REVIEWS” is a bonafide record of project done at the department of Information
Technology, Velammal Engineering College during the academic year 2022-2023 by
Student Name (Reg No)

Of Third year Bachelor of Technology in Artificial Intelligence and Data Science submitted for the
university examination held on...............................
INTERNAL EXAMINER
iii
ABSTRACT
iv
ACKNOWLEDGEMENT
Behind every achievement lies an unfathomable sea of gratitude to those who

achieved it, without whom it would ever have come into existence. To them we
express our words of gratitude.
We give all the glory and thanks to our almighty GOD for showering upon, the
necessary wisdom and grace for accomplishing this project. We express our gratitude
and thanks to Our Parents first for giving health and sound mind for completing this
project.
First of all, we would like to express our deep gratitude to our beloved and
respectable Chairman Thiru M.V. Muthuramalingam and our Chief Executive
Officer Thiru M.V.M. Velmurugan for their kind encouragement.
We express our deep gratitude to Dr. S. Satish Kumar, Principal, Velammal
Engineering College for his helpful attitude towards this project. We wish to express
our sincere thanks and gratitude to Dr.P.Visu, Professor and Head of the
Department, Department of Artificial Intelligence and Data Science for motivating
and encouraging in every part of our project.
We express our sincere gratitude to the Project Coordinator name Mrs/Mr and
Project Guide Guide Name, Assistant Professor, Department of Artificial
Intelligence and Data Science for their invaluable guidance in shaping of this project
without them this project would not have been possible.
Our thanks to all other Faculty and Non-Teaching staff members of our
department for their support and peers for having stood by me and helped me to
complete this project.
v
TABLE OF CONTENTS
CHAPTER NO. TITLE PAGE NO.

ABSTRACT iv
ACKNOWLEDGEMENT v
TABLE OF CONTENTS vi
1 INTRODUCTION 1
1.1 PROJECT OUTLINE 1
TOOLS/PLATFORM 1
INTRODUCTION 1
MOTIVATION 3
PROBLEMS 4
3 SOFTWARES USED
LIBRARIES/MODULES USED 9
TOOLS USED 9
3.2.1 FROM NLTK 9
3.2.2 FROM SCIKIT LEARN 9
3.2.3 FROM MATPLOTLIB 9
4 MODULE IMPLEMENTATION
vi
4.1 DATA PRE – PROCESSING 10
4.1.1 DATA CLEANING 10
STOPWORDS 11
STEMMING 12
4.2 DATA TRANSFORMATION 14
4.2.1 COUNT VECTORIZER 14
4.2.2 CORPUS 15
4.2.3 PICKLE 16
4.2.4 BAG OF WORDS MODEL 17
4.6.1 MATPLOTLIB 29
5 SNAPSHOTS AND DISCUSSION

5.1 OUTPUT 33
5.1.1 OUTPUT OF SENTIMENTAL ANALYSIS
MODEL 33
5.1.2 OUTPUT OF SENTIMENTAL PREDICTOR 35
5.2 RESULTS 37
6 CONCLUSION AND FUTURE ENHANCEMENT 38
REFERENCES 39
vii
CHAPTER 1
INTRODUCTIO
N
Project Outline:
Title of the project: - Restaurant Sentiment Analysis.
Tools/Platform:
1 Operating System: WINDOWS 10

2 Language: Python
3 Tools Used: Google Colaboratory
Introduction:
As internet is growing bigger, its horizons are becoming wider. Social Media
and Micro blogging platforms dominate in spreading recommend places based on
reviews across the globe at a rapid pace. A topic becomes trending if more and
more users are contributing their opinion and judgments, thereby making it a
valuable source of online perception. Large organizations and firms take
advantage of people's feedback to improve their products and services which
further help in enhancing marketing strategies. Thus, there is a huge potential of
discovering and analyzing from the infinite social media data for business- driven
applications.
Sentiment analysis is the prediction of emotions in a word, sentence or corpus

of documents. It is intended to serve as an application to understand the attitudes,
opinions and emotions expressed within an online mention. The intention is to
gain an overview of the wider public opinion behindcertain topics. Precisely, it is
a paradigm of categorizing conversations into positive, negative, or neutral
labels. Many people use social media sites for networking with other people
and to stay up-to-date with news and current events.
The social media sites offer a platform to people to voice their opinions. For
example, people quickly post their reviews online as soon as they have a food in
a restaurant and then start a series of comments to discuss about the ambience of
1
restaurant. This kind of information forms a basis for the people to evaluate, rate
about the performance of not only any restaurant but about other products and to
know about whether it will be a success or not. This type of vast information on
these sites can be used for marketing and social studies.
Therefore, sentiment analysis has wide applications and includes emotion

mining, polarity, and classification and influence analysis. Restaurants nowadays
prefer taking online orders. It not only helps in getting effective customer
feedback but also useful for managing orders easily.
We are moving towards an automated and digital world. Having a significant

online presence is necessary for any restaurant to be successful and prosperous.
Getting customer feedback and analyzing them in an effective manner makes the
difference. This study analyses the restaurant reviews and presents useful
information that the ratings do not consider or overlook. Combined research is
done using two different datasets of restaurant reviews in this paper. Machine
learning algorithms like Naïve Bayes and Logistic regression is used for first
classifying the reviews in proper aspects, performing sentiment analysis on them.
Summarization is done using gensim and results are displayed using effective
visualization techniques.
Future work is also discussed so that an efficient analysis system can be

developed using the potential of reviews.
Customer satisfaction is an opinion or feeling between expectation and reality

obtained by consumers. Today, many customers write opinions in the form of
reviews about their obtained satisfaction on online media.
Customer reviews on the online media become important as it might increase

the popularity of the product or service sold by the seller. A restaurant is a
business that prepares and serves food for customers and exchanges for a certain
amount of money.
Although improving quality through this method is considered useful, only

some restaurants use customer satisfaction analysis to improve their services.
Also, many algorithms might be used for doing the study. Restaurant customer
satisfaction research through reviews on online media such as TripAdvisor is still
2
rare. Generally, restaurant customer satisfaction analyses through product data,
nutrition data and food preparation.
Restaurant reviews are still in the form of text, customer reviews are included
in the text mining category, and the results of these data will be classified into
two values, positive or negative. Retrieving data for preprocessing review data
such as remove stop word, remove punctuation is done with the help of Python,
while for classifying data using Waikato Environment for Knowledge Analysis
(WEKA) software with the Naive Bayes method and also using Tex Blob whichis
a python-based sentiment analyzer to compare. Naive Bayes is chosen because
this method has been widely implemented in sentiment analysis.
Another aim is to find the best method for analyzing restaurant customer
review data by comparing Naive Bayes method and Text Blob sentiment analysis
sincethe two methods have fundamental differences in terms of calculation.
MOTIVATION:
Humans ourselves are not able to understand how exactly language is

processed by our brains. So, is it possible for us to teach a machine to learn our
language? Yes, through extensive research, a lot of methods have been developed
that could help machines understand our languages. NLP or Natural Language
Processing is the field of study that focuses on the interactions between human
language and computers. One sub problem of NLP is sentiment analysis, i.e
classifying a statement as positive or negative. What is the use of classifying a
statement as positive or negative? Let’s take an example of Amazon website. On
Amazon, its users can leave a comment about a product stating whether it was
good, bad or it could even be neutral. Now, using a human to read all the
comments and obtaining the overall customer feedback on the product would be
expensive and time-consuming. The machine learning model can churn through
a vast amount of data, making inferences and classifying the comment. Using
this ML model, Amazon can better its products through the customer reviews
which would bring in more revenue for the company.
Sentiment analysis isn’t as straightforward as it may seem. If you think that

thecomments which contain the words “good”, “awesome” can be classified as a
3
positive comment and the comments which the words “bad”, “miserable” can be
classified as a negative comment, think again. Ex.: “Completely lacking in good
taste” and “Good for a quick meal but nothing special” representsa negative and
neutral feedback respectively even though they have the word “good” in them.
Therefore, as I mentioned the task may not be as easy as it mayseem. Let’s move
on to the data we will be working to solve the problems.
PROBLEMS
For year’s food and hospitality businesses are running on the assumption that
good food and service is the way to attract more customers. But the advent of
science and technology, more importantly, the data created by the use of online
platforms has pointed towards new findings and opened new doors: Most
consumers nowadays rate a product online, over 1/3rd of them write reviews and
nearly 88% of the people trust online reviews. Review Services like Yelp, Google
Reviews, etc. provide customers and businesses a way to interact with one
another.
Reviews and Ratings are useful sources of information but significant

problems exist in extracting relevant information and predicting the future
through analysis and correlation of the existing data. Each day thousands of
restaurants and are reviewed by the customers.
The main objective of the work proposed in this paper is to enhance the user
experience by analyzing the reviews of restaurants and categorize them in some
aspects so that a user can easily know about the restaurant. Restaurants are not
able to utilize reviews for their profits. We want to use the aspects that are
important in the food and service industry so that we can analyze the sentimentof
text reviews and help them to improve their businesses.
This research paperwork can be applied to many other industries related to

food and hospitality.
4
CHAPTER 2
LITERATURE SURVEY
When diving into this literature work, we found that a large amount of
relevant work has already been done in this field but what was missing was the
fact that most of them were not industry oriented. We tried to incorporate the
most noticeable findings in these works as our base so we could build upon the
work already done. Most of the work focuses on improving the models for
classification. The Fakeness of the reviews is a common problem that arises.
Some deep learning techniques have also been compared with classical
techniques. Some of these works have been summarized in the subsequent
sections.
Customer satisfaction is an essential concern in the field of marketing and
research in terms of consumer behavior. As in the habits of hotel consumers
when they get excellent service, they will transmit to others mouth to mouth Text
mining or retrieval of data from a collection of documents stores frequently with
the help of analysis tools or manuals. Through the analysis process of several text
mining perspectives, information can be produced that can be used to increase
profits and services. Sentiment analysis is used to find opinions from the author
about a specified entity. Sentiment analysis of a review is an opinion
investigation of a product. The basis of sentiment analysisis using Natural
Language Processing (NLP), text analysis and some computational portions to
extract or omit unnecessary parts to see the pattern of the sentence negative or
positive.
In the l5th century, Reverend Thomas Bayes developed a method known
as Naive Bayes that used probability and opportunity approaches. Naive Bayes
calculates future probability predictions from data or experiences that have been
given, based on the opportunity point of view . One characteristic of the Naive
Bayes Classification is the existence of independent input variables which
assume the presence of an articular feature from a class that is mutually
independent of other features.
A survey of Sentiment Analysis Challenges:
The consequences of challenges in the area of sentiment analysis have been

5
discussed. Sentiment review structure is compared with sentiment analysis
challenges in the first distinction.
The effect of this distinction shows that domain-dependence is an

important part of sentiment challenges. The second comparison deals with the
accuracy of sentiment analysis models based on the challenges. Structured, Semi-
structured and Unstructured are three types of review structures that were usedfor
the first comparison.
Theoretical and technological are the two types of sentiment analysis

challenges. The challenges include Domain dependence, negation, bipolar words
entity feature/ keyword extraction spam, or fake review NLP overheadslike (short
abbreviations, ambiguity, emotions, sarcasm). Parts-of-speech (POS) tagging
gives highly accurate results for the theoretical types of challenges.
The phrases and expressions of n-gram give it an edge over all other
techniques used for a technical set of challenges. The results explained the
effectiveness ofsentiment analysis challenges for improving the accuracy of the
model.
Aspect based Sentiment Oriented Summarization of Hotel Reviews:
Due to the unstable size of review dimensions and customer produced

content, different text analytic approaches like opinion mining, sentiment
analysis, topic modeling, aspect classification, play a significant role in analyzing
the content. Topic Modelling can find diverse topics in a corpus of text because
ofits statistical nature. For every aspect type, there is a certain opinion linked to it
and the Sentiment analysis method can effectively bring out these emotions.
Whether it is a business intelligence problem or a case of unstructured

document categorization sentiment analysis is useful for most of the cases. It has
emerged as the most important aspect of the Information Retrieval process. The
strategies regarding text summarization can boost sentiment analysis research.
The reviews were summarized on different aspects and sentiment analysis

was performed.
6
Assessing the Helpfulness of Online Hotel Reviews:
A review can be considered useful for decision making purposes only

when it is thoughtful and insightful.
The indicators representing the importance of reviews are different for

diverse areas due to ease of access. In the case of travel and hospitality websites,
the reviews containing maximum votes are considered to be more informative
and useful for consumers.
It can be helpful in optimizing the cost of the search for most of the
consumers by using feature engineering.
A framework for Fake Review Detection in Online:
This paper paves an idea of the challenges that we could face in our
research. Specifying the significance of online feedback for different types of
industries and the amount of difficulty attached in procuring and maintaining a
favorable honor on the Internet, diverse methods have been used to enhance
digital existence, including unethical practices.
Fake reviews are one of the most preferred unethical methods which exist
on sites. In response to that Fake Feature Framework (F3), helps to assemble and
constitute features for fake reviews choice.
F3 estimates knowledge obtained from both the user (personal profile,

reviewing activity, trusting information, and social interactions) and review
elements (review text).
Sentiment Analysis of Hotel Reviews:
It is observed that Semantic orientation can also be used as a sentiment

analysis model to classify reviews as 0 or 1 representing negative and positive
respectively.
It is possible to classify a review on the foundation of the average semantic
7
orientation of phases in the review which comprises adverbs and adjectives. It is
expected that there will be an efficient value when we merge semantic orientation
with sentiments. The review is recommended only if the mean is positive and
otherwise not it is recommended. The Naive Bayes model generally performs
better than SVM.
Proposed System:
FLOW DIAGRAM:
8
CHAPTER 3
SOFTWARES
USED
LIBRARIES/MODULES USED:
 NumPy (Numerical programming package for the Python

programming language)
 Pandas (Python library for data manipulation and analysis)
 Scikit Learn (Machine learning library for the Python programming language)
 Joblib (Set of tools to provide lightweight pipelining in Python)
 Pickle (Module for Serializing and Deserializing Python Object)
 Re (Module specifies the set of Strings that matches the given Regular
Expression)
 NLTK (Library for symbolic and statistical natural language processing (NLP))
 Matplotlib (Module for Data Visualization)
TOOLS USED:
FROM NLTK:
 Corpus
 Stop words
 Port Stemmer
 BagOfWords(BOW)
3.2.2. FROM SCIKIT LEARN:

 Count Vectorizer
 Train test split
 GaussianNB
 Confusion Matrix
 Accuracy Score
3.2.3 FROM MATPLOTLIB:

9
 Pyplot
10
CHAPTER 4
MODULE IMPLEMENTATION
The main approach involved in this project is the various data pre-
processing steps, the machine learning classifiers and feature extraction. The
main Machine Learning Algorithm used is Naïve Bayes. The main data pre-
processing methods include Stop words removal, Stemming.
STEPS:
1. Importing Dataset
2. Data Preprocessing
3. Data Transformation
4. Diving Dataset into training Set and Test set
5. Model Training
6. Checking Performance
7. Importing fresh Dataset
8. Data Preprocessing
9. Data Transformation
10. Predictions
11. Data Visualization based on Prediction (Pie-Chart)
DATA PRE-PROCESSING:
DATA CLEANING:
While it may seem like an easy task when we manually edit a couple of
hundred comments, for example, TikTok insight but when we have to assess
numerous videos, say when doing an Instagram analysis, such scenarios mean
videos with an aggregated sum of comments running into thousands. In such a
11
case, we need an automated sentiment analysis tool. However, for that tool to
give you accurate and high-precision results, we need to make sure that you have
a high-quality dataset prepped for analysis.
Incorrect sentiment analysis data preparation affects the algorithm and

leads to incorrect analysis, even though at first glance it may all look in order. On
the business front, this could mean a loss not only in resources and man-hours but
also the risk of an expensive campaign gone wrong because it was built on
incorrect insights.
Some of the measures to conduct data cleaning include:
 Delete duplicate data

 Remove irrelevant items
 Check for outlier data
 Correct typos and structural mistakes
 Check for missing data
 Validate your data
To clean data we use tools such as
 Stopwords
 PortStemmer
STOPWORDS:
The words which are generally filtered out before processing a natural
language are called stop words. These are actually the most common words in
any language (like articles, prepositions, pronouns, conjunctions, etc) and do not
add much information to the text. Examples of a few stop words in English are
“the”, “a”, “an”, “so”, “what”.
Stop words are available in abundance in any human language. By

removing these words, we remove the low-level information from our text in
order to give more focus to the important information. In other words, we can say
that the removal of such words does not show any negative consequences on the
model we train for our task.
12
Removal of stop words definitely reduces the dataset size and thus reduces
the training time due to the fewer number of tokens involved in the training.
Do we always remove stop words? Are they always useless for us?
The answer is no!
We do not always remove the stop words. The removal of stop words is
highly dependent on the task we are performing and the goal we want to achieve.
For example, if we are training a model that can perform the sentiment analysis
task, we might not remove the stop words.
Movie review: “The movie was not good at
all.” Text after removal of stop words: “movie
good”
We can clearly see that the review for the movie was negative. However,
after the removal of stop words, the review became positive, which is not the
reality. Thus, the removal of stop words can be problematic here.
Tasks like text classification do not generally need stop words as the other
words present in the dataset are more important and give the general idea of the
text. So, we generally remove stop words in such tasks.
In a nutshell, NLP has a lot of tasks that cannot be accomplished properly after
the removal of stop words. So, think before performing this step. The catch here
is that no rule is universal and no stop words list is universal. A list not
conveying any important information to one task can convey a lot of information
to the other task.
STEMMING:
Stemming is considered to be the more crude/brute-force approach to

normalization (although this doesn’t necessarily mean that it will perform worse).
There are several algorithms, but in general they all use basic rules to chop off
13
the ends of words.
14
Stemming is the process of producing morphological variants of a
root/base word. Stemming programs are commonly referred to as stemming
algorithms or stemmers. A stemming algorithm reduces the words “chocolates”,
“chocolatey”, “choco” to the root word, “chocolate” and “retrieval”, “retrieved”,
“retrieves” reduce to the stem “retrieve”. Stemming is an important part of the
pipelining process in Natural language processing. The input to the stemmer is
tokenized words. How do we get these tokenized words? Well, tokenization
involves breaking down the document into different words.
Some Stemming algorithms are
 Porter’s Stemmer algorithm

 Lovins Stemmer
 Dawson Stemmer
 N-Gram Stemmer
 Snowball Stemmer
 Lancaster Stemmer
We’ll use the Porter stemmer here because the rules associated with suffix
removal are much less complex in case of Porter's Stemmer and it uses a single,
unified approach to the handling of context.
PORTER’S STEMMER ALGORITHM:
It is one of the most popular stemming methods proposed in 1980. It is

based on the idea that the suffixes in the English language are made up of a
combination of smaller and simpler suffixes. This stemmer is known for its speed
and simplicity. The main applications of Porter Stemmer include data mining and
Information retrieval. However, its applications are only limited to English
words. Also, the group of stems is mapped on to the same stem and the output
stem is not necessarily a meaningful word. The algorithms are fairly lengthy in
nature and are to be the oldest
stemmer.
15
Example: EED -> EE means “if the word has at least one vowel and consonant
plus EED ending, change the ending to EE” as ‘agreed’ becomes ‘agree’.
DATA TRANSFORMATION:
COUNT VECTORIZER:
Machines cannot understand characters and words. So when dealing with

text data we need to represent it in numbers to be understood by the machine.
Countvectorizer is a method to convert text to numerical data.
Example: text = [‘Hello my name is Priya, this is my python notebook’]
The text is transformed to a sparse matrix as shown below.
16
We have 8 unique words in the text and hence 8 different columns each
representing a unique word in the matrix. The row represents the word count.
Since the words ‘is’ and ‘my’ were repeated twice we have the count for those
particular words as 2 and 1 for the rest. CountVectorizer makes it easy for text
data to be used directly in machine learning and deep learning models such as
text classification.
CountVectorizer is a great tool provided by the scikit-learn library in

Python. It is used to transform a given text into a vector on the basis of the
frequency (count) of each word that occurs in the entire text. This is helpful when
we have multiple such texts, and we wish to convert each word in each text into
vectors (for using infurthertextanalysis).
CORPUS:
A corpus is a large and structured set of machine-readable texts that have

been produced in a natural communicative setting. They can be derived in
different ways like text that was originally electronic, transcripts of spoken
language and optical character recognition
Corpus linguistics is a methodology that involves computer-based

empirical analyses (both quantitative and qualitative) of language use by
employing large, electronically available collections of naturally occurring
spoken and written texts, so-called corpora.
Using the corpus, we build a sentiment classifier that is able to determine
17
positive, negative and neutral sentiments for a document.
HOW CORPUS CREATED
Corpus creation is the process of building a dataset. For a digital

humanities project, this often entails either finding a collection of texts or images
online or digitizing physical holdings.
A corpus is a representative sample of actual language production within a

meaningful context and with a general purpose. A dataset is a representative
sample of a specific linguistic phenomenon in a restricted context
USE OF CORPUS
Corpora are essential in particular for the study of spoken and signed
language: while written language can be studied by examining the text, speech,
signs and gestures disappear when they have been produced and thus, we need
multimodal corpora in order to study interactive face-to- face communication.
EXAMPLE
An example of corpus is a group of ten sentence examples for the same

word. Noun.
PICKLE:
Python pickle module is used for serializing and de-serializing python

object structures. The process to converts any kind of python objects (list, dict,
etc.) into byte streams (0s and 1s) is called pickling or serialization or flattening
or marshalling. We can converts the byte stream (generated through pickling)
back into python objects by a process called as unpickling.
Why Pickle? In real world scenario, the use pickling and unpickling are
widespread as they allow us to easily transfer data from one server/system to
another and then store it in a file or database.
18
Pickle a simple list: Pickle_list1.py
import pickle
mylist = ['a', 'b', 'c', 'd']
with open('datafile.txt', 'wb') as fh:
pickle.dump(mylist, fh)
In the above code, list – “mylist” contains four elements (‘a’, ‘b’, ‘c’, ‘d’).
We open the file in “wb” mode instead of “w” as all the operations are done
using bytes in the current working directory. A new file named “datafile.txt” is
created, which converts the mylist data in the byte stream.
In our Context, we pickle the BagOfWords Model file.
BAG OF WORDS MODEL:
The bag-of-words model is a way of representing text data when

modelling text with machine learning algorithms. The bag-of-words model is
simple to understand and implement and has seen great success in problems such
as language modelling and document classification.
Why did we use BagOfWords?
Whenever we apply any algorithm in NLP, it works on numbers. We

cannot directly feed our text into that algorithm. Hence, Bag of Words model is
used to preprocess the text by converting it into a bag of words, which keeps a
count of the total occurrences of most frequently used words. This model can be
visualized using a table, which contains the count of words corresponding to the
word itself.
What is a Bag-of-Words?
A bag-of-words model, or BoW for short, is a way of extracting features
from text for use in modelling, such as with machine learning algorithms.
The approach is very simple and flexible, and can be used in a myriad of
ways for extracting features from documents.
19
A bag-of-words is a representation of text that describes the occurrence of
words within a document. It involves two things:
1. A vocabulary of known words.

2. A measure of the presence of known words.
It is called a “bag” of words, because any information about the order or

structure of words in the document is discarded. The model is only concerned
with whether known words occur in the document, not where in the document.
The bag-of-words can be as simple or complex as we like. The complexity comes
both in deciding how to design the vocabulary of known words (or tokens) and
how to score the presence of known words.
20
DIVIDING DATASET:
Train Test Split:
Train test split is a model validation procedure that allows you to simulate how
a model would perform on new/unseen data. The main function is to Split arrays
or matrices into random train and test subsets. Here is how the procedure works.
Step 1. Make sure your data is arranged into a format acceptable for train test
split. In scikit-learn, this consists of separating your full dataset into Features and
Target.
Step 2. Split the dataset into two pieces: a training set and a testing set. This
consists of randomly selecting about 75% (you can vary this) of the rows and
putting them into your training set and putting the remaining 25% to your test set.
Note that the colors in “Features” and “Target” indicate where their data will go
(“X_train”, “X_test”, “y_train”, “y_test”) for a particular train test split.
Step 3. Train the model on the training set. This is “X_train” and “y_train” in the
image.
Step 4. Test the model on the testing set (“X_test” and “y_test” in the image) and
evaluate the performance.
21
MODEL FITTING:
Here we use Gaussian Naïve Bayes Classifier Model.
Naïve Bayes is a probabilistic machine learning algorithm used for many

classification functions and is based on the Bayes theorem. Gaussian Naïve
Bayes is the extension of naïve Bayes. While other functions are used to estimate
data distribution, Gaussian or normal distribution is the simplest to implement as
you will need to calculate the mean and standard deviation for the training data.
NAÏVE BAYES:
Naïve Bayes is one of the basic text Classification Algorithms. It is a

simple classifier based on Bayes theorem and makes naïve independence
assumptions of the feature variables. It is seen to perform very well in many real
world problems.
Naive Bayes is the simplest and fastest classification algorithm for a large
chunk of data. In various applications such as spam filtering, text classification,
sentiment analysis, and recommendation systems, Naive Bayes classifier is used
successfully. It uses the Bayes probability theorem for unknown class prediction.
The Naive Bayes classification technique is a simple and powerful

classification task in machine learning. The use of Bayes’ theorem with a strong
independence assumption between the features is the basis for naive Bayes
classification. When used for textual data analysis, such as Natural Language
Processing, the Naive Bayes classification yields good results.
Simple Bayes or independent Bayes models are other names for nave
Bayes models. All of these terms refer to the classifier’s decision rule using
Bayes’ theorem. In practice, the Bayes theorem is applied by the Naive Bayes
classifier. The power of Bayes’ theorem is brought to machine learning with this
classifier.
The Principle of Naïve Bayes theorem is Every pair of features being

classified is independent of each other.
22
BAYES THEOREM:
It finds the probability of an event occurring given the probability of

another event that has already occurred. Bayes’ theorem is stated mathematically
as the following equation:
P(A\B) = P(B\A) P(A)

P(B)
where A and B are events and P(B) ≠ 0.
Using Bayes theorem, we can find the probability of class A given the
features B, the class that gives the maximum probability that the given features
predict, is our desired result.
In this experiment, the Naive Bayes classifier from NLTK was used to train and
test the data.
𝑦 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦𝑃(𝑦)𝜋𝑛𝑖=1 𝑃(𝑥𝑖|𝑦)
where, y is class variable and X is a dependent feature vector (of size n).
Please note that P(y) is also called class probability and P(xi | y) is
called conditional probability.
The different naive Bayes classifiers differ mainly by the assumptions they
make regarding the distribution of P(xi | y).
Naive Bayes classification algorithm tends to be a baseline solution for

sentiment analysis task. The basic idea of Naive Bayes technique is to find the
probabilities of classes assigned to texts by using the joint probabilities of words
and classes.
Bayes’ theorem is based on conditional probability. The conditional

probability helps us calculating the probability that something will happen, given
that something else has already happened.
23
TYPES OF NAÏVE BAYES MODEL:
There are three types of Naive Bayes Model, which are given below:
GAUSSIAN: The Gaussian model assumes that features follow a normal

distribution. This means if predictors take continuous values instead of discrete,
then the model assumes that these values are sampled from the Gaussian
distribution.
MULTINOMIAL: The Multinomial Naïve Bayes classifier is used when the

data is multinomial distributed. It is primarily used for document classification
problems, it means a particular document belongs to which category such as
Sports, Politics, education, etc. The classifier uses the frequency of words for the
predictors.
BERNOULLI: The Bernoulli classifier works similar to the Multinomial

classifier, but the predictor variables are the independent Booleans variables.
Such as if a particular word is present or not in a document. This model is also
famous for document classification tasks.
Here as our prediction is continuous value, we use Gaussian Naive Bayes

Classifier.
In Gaussian Naive Bayes, continuous values associated with each feature

are assumed to be distributed according to a Gaussian distribution. A Gaussian
distribution is also called Normal distribution. When plotted, it gives a bell-
shaped curve which is symmetric about the mean of the feature values.
24
JOBLIB:
Exporting the NB Classifier is done using the Module JOBLIB.
Joblib is a set of tools to provide lightweight pipelining in Python.
In particular,
• transparent disk-caching of functions and lazy re-evaluation
(memorize pattern)
• easy simple parallel computing
Joblib is optimized to be fast and robust on large data in particular and has
specific optimizations for NumPy arrays. It is BSD-licensed.
Joblib is one of the python libraries that provides easy to use interface for
performing parallel programming in python. The machine learning library scikit-
learn also uses joblib behind the scene for running its algorithms in
parallel. joblib is basically wrapper library which uses other libraries for running
code in parallel. It also lets us choose between multi-threading and multi-
processing. joblib is ideal for a situation where you have loops and each iteration
through loop calls some function which can take time to complete. This kind of
function whose run is independent of other runs of the same functions in for loop
is ideal for parallelizing with joblib.
25
MAIN FEATURES
Transparent and fast disk-caching of output value: a memoize or make-like

functionality for Python functions that works well for arbitrary Python objects,
including very large numpy arrays. Separate persistence and flow-execution logic
from domain logic or algorithmic code by writing the operations as a set of steps
with well-defined inputs and outputs: Python functions. Joblib can save their
computation to disk and rerun it only if necessary.
Embarrassingly parallel helper: to make it easy to write readable parallel code

and debug it quickly.
Fast compressed Persistence: a replacement for pickle to work efficiently on

Python objects containing large data (joblib.dump & joblib.load ).
26
CHECKING MODEL PERFORMANCE:
CONFUSION MATRICS
A confusion matrix is a table that is often used to describe the

performance of a classification model on a set of test data for which the true
values are known.
Confusion matrix is a very popular measure used while solving classification
problems. It can be applied to binary classification as well as for multiclass
classification problems.
The confusion matrix visualizes the accuracy of a classifier by

comparing the actual and predicted classes.
Confusion Matrix is a useful machine learning method which allows

you to measure Recall, Precision, Accuracy, and AUC-ROC curve.
An example of a confusion matrix for binary classification:
 TP: True Positive: Predicted values correctly predicted as

actual positive
 FP: Predicted values incorrectly predicted an actual positive.
i.e., Negative values predicted as positive
 FN: False Negative: Positive values predicted as negative
 TN: True Negative: Predicted values correctly predicted as an actual
negative
27
You can compute the accuracy test from the confusion matrix:
Accuracy can be misleading if used with imbalanced datasets, and

therefore there are other metrics based on confusion matrix which can be
useful for evaluating performance. In Python, confusion matrix can be
obtained using “confusion matrix()” function which is a part
of “sklearn” library . This function can be imported into Python using “from
sklearn metrics import confusion matrix.” To obtain confusion matrix,
users need to provide actual values and predicted values to the function.
ACCURACY SCORE:
The accuracy score method is used to calculate the accuracy of either

the faction or count of correct prediction in PythoN Scikit Learn.
Mathematically it represents the ratio of the sum of true positives and true
negatives out of all the predictions.
We can use the following measures:
 Cross-Validation Accuracy - This accuracy would be

calculated when performing k-fold cross-validation to tune the
parameters.
 Accuracy [Num. of Correct Queries / Total Num. of Queries] -

You would use this to check the overall accuracy of the system.
 F1-Measure [2 * (Precision * Recall) / (Precision + Recall)] -

This one of the most important measures and will tell you how
your system is performing. Where Precision = True Positive /
(True Positive + False Positive), and Recall = True Positive / (True
Positive + False Negative
Confusion Matrix (Also called Error Matrix.) - This would also tell you
how your system is performing and is similar to F1-Measure.
28
Accuracy score can be calculated using the formula:
Accuracy Score = (TP+TN)/ (TP+FN+TN+FP)
Here we can also calculate accuracy with the help of the accuracy
score method from sklearn.
accuracy score(y_true, y_pred, normalize=False)
In multilabel classification, the function returns the subset accuracy. If

the whole set of predicted labels for the sample accurately matches with the
true set of labels. Then the accuracy of the subset is 1.0 otherwise, its
accuracy is almost 0.0.
Syntax:
sklearn.metrics.accuracy_score(y_true,y_pred,normalize=False,
sample_weight=None)
Parameters:
 y_true: label indicator array / sparse matrix correct label.

 y_pred: label indicator array / sparse matrix predicted labels as
returned by the classifiers.
 Normalize: It contains the boolean value (True/False). If False,
return the number of correctly confidential samples. Otherwise,
it returns the fraction of correctly confidential samples.
How scikit learn accuracy score works
The scikit learn accuracy score works with multilabel classification in

which the accuracy score function calculates subset accuracy.
 The set of labels that predicted for the sample must exactly
match the corresponding set of labels in y true.
 Accuracy that defines how the model performs all classes. It is
useful if all the classes are equally important.
 The accuracy of the model is calculated as the ratio between the
numbers of correct predictions to the total number of
predictions.
29
DATA VISUALIZATION:
Data visualization is the visual presentation of data or information.

The goal of data visualization is to communicate data or information clearly
and effectively to readers. Typically, data is visualized in the form of
a chart, infographic, diagram or map.
Data Visualization states that after data has been collected, processed
and modelled, it must be visualized for conclusions to be made. Data
visualization is also an element of the broader data presentation architecture
(DPA) discipline, which aims to identify, locate, manipulate, format and
deliver data in the most efficient way possible.
30
The visualization process involves generally four steps:
1. Load and prepare the datasets: Normally you will pick a data set
and visualize its observations. But the dataset must be cleaned
first, filling of empty cells must be done, change categorical
variables to numeric if necessary, and detecting outlier sometimes.
If you clean the dataset before visualization the result will be more
trustworthy.
2. Import the visualization libraries provided by python as per

requirements. Most commonly used are Matplotlib and seaborn.
3. Plot the graph: After importing the libraries you will set many
hyperparameters for size and display, and pass the datasets which
will be visualized and then plot the diagram with proper syntax.
4. Display it on the screen. Finally, display the diagram.
Here we use Matplotlib Module from Python to visualize our output

(The predictions we made using Customer reviews).
MATPLOTLIB:
Matplotlib is a plotting library for the Python programming language

and its numerical mathematics extension NumPy. It provides an object-
oriented API for embedding plots into applications using general-
purpose GUI Toolkits like Tkinter, wxPython.
Matplotlib is a multi-platform data visualization library built on

NumPy arrays and designed to work with the broader SciPy stack. It was
introduced by John Hunter in the year 2002.
One of the greatest benefits of visualization is that it allows us visual

access to huge amounts of data in easily digestible visuals. Matplotlib
consists of several plots like line, bar, scatter, histogram etc.
matplotlib.pyplot is a collection of command style functions that

make Matplotlib work like MATLAB. Each Pyplot function makes some
change to a figure. For example, a function creates a figure, a plotting area in
31
a figure, plots some lines in a plotting area, decorates the plot with labels, etc .
Matplotlib is designed to be as usable as MATLAB, with the ability to use
Python, and the advantage of being free and open-source.
Types of Plots:
Sr.No Function & Description
1
Bar
Make a bar plot.
2
Barh
Make a horizontal bar plot.
3
Boxplot
Make a box and whisker plot.
4
Hist
Plot a histogram.
5
hist2d
Make a 2D histogram plot.
6
Pie
Plot a pie chart.
7
Plot
Plot lines and/or markers to the Axes.
8
Polar
Make a polar plot..
9
Scatter
Make a scatter plot of x vs y.
32
10
Stackplot
Draws a stacked area plot.
11
Stem
Create a stem plot.
12
Step
Make a step plot.
13
Quiver
Plot a 2-D field of arrows.
Here we have used Pie-Chart for visualizing the data.
PIE-CHART:
A Pie Chart can only display one series of data. Pie charts show the
size of items (called wedge) in one data series, proportional to the sum of the
items. The data points in a pie chart are shown as a percentage of the whole
pie.
Matplotlib API has a pie () function that generates a pie diagram

representing data in an array. The fractional area of each wedge is given
by x/sum(x). If sum(x) < 1, then the values of x give the fractional area
directly and the array will not be normalized. Theresulting pie will have an
empty wedge of size 1 - sum(x).
The pie chart looks best if the figure and axes are square, or the Axes aspect
is equal.
With Pyplot, you can use the pie () function to draw pie charts.
plt.pie(y)
plt.show()
33
Basic Parameters:
 Y :1D array-like
 explodearray-like, default: None
 labelslist, default: None
 colorsarray-like, default:
None Returns:
 Patches : list
 Texts : list
 Autotexts : list
34
CHAPTER 5
SNAPSHOTS AND DISCUSSION
OUTPUT:
OUTPUT OF SENTIMENT PREDICTOR:
Here dataset.head() retrieves the first 5 rows from the given dataset.
CODING
SENTIMENT ANALYSIS MODEL
###Importing libraries
import numpy as np
import pandas as pd
### Importing dataset
dataset = pd.read_csv('./drive/MyDrive/Sentiment_Analysis1/Pr
oject2_Sentiment_Analysis/a1_RestaurantReviews_HistoricDump.tsv',
delimiter = '\t', quoting = 3)
dataset.shape
dataset.head()
35
CHAPTER 6
FUTURE ENHANCEMENT AND CONCLUSION:
CONCLUSION:
The proposed project retrieves the reviews of a restaurant given by the

customers in the form of TSV (Tab Separated Values) file. At first, we
trained the model, using our sample dataset related to restaurant reviews
using GaussianNB Classifier and found the accuracy of the model which was
72%. After training the model, we imported the review dataset (tsv file) of
the restaurant, and then we predicted the overall sentiment of the Customers.
Giving this predicted result as input, we performed Data Visualization in the
form of Pie Chart. Thus, our proposed system shows the sentiments of the
Customers to the Restaurant owner clearly. Using this predicted output, if the
result is negative, they will improve the service as per their customer needs.
If the result is positive, they will appreciate their staffs for their success and
they will try to improve their services furthermore.
The aim of this project is to improve the services of any organization

as per their customer needs and to know about the feedback of the customers,
so that they can improve their service to attract more customers.
FUTURE ENHANCEMENT:
The project can be developed with furthermore features like Detecting

fake reviews, tracking the reviews in a regular manner. We can also add this
sentiment analysis feature in ecommerce websites such as Amazon to
differentiate good and bad quality products. It can be also developed as live
predicting system like predicting the sentiment based on live reviews, as the
reviews and ratings change for each time, our predicted output also should
change as per the current reviews and rating. This System can be very useful
in education System, as Students can write review about the classes,
management, teachers and the facilities provided. By using this reviews and
taking the appropriate actions, we can bring up a good Education System.
36
REFERENCES
1. Rachamawam Adi Lakasono, Kelly R.Sungkono, Riyanarto Sarno,

“Sentiment Analysis of Restaurant Customer Reviews on TripAdvisor
using Naïve Bayes”, July 2019.
2. Masur Adnan, Riyanarto Sarno, Kelly Rossa Sungkono, “Sentiment

Analysis of Restaurant Review with Classification Approach in the
Decision Tree-J48 Algorithm”, Published in: 2019 International
Seminar on Application for Technology of Information and
Communication (iSemantic).
3. Spoorthi C, Dr. Pushpa Ravikumar, Mr. Adarsh M.J, "Sentiment

Analysis of Customer Feedback on Restaurant Reviews", Proceedings
of the Second International Conference on Emerging Trends in
Science & Technologies For Engineering Systems (ICETSE-2019)
4. Engr. Mohammad Aman Ullah, Syeda Maliha Marium, Nibadita Saha

Dipa, “An algorithm and method for sentiment analysis using the text
and emoticon”, July 2020.
5. Kanwal Zahoor, Narmeen Bawany, Soomaiya Hamid, “Sentiment

Analysis and Classification of Restaurant Reviews using Machine
Learning”, November 2020.
6. Wu Xiaoping, "The impact of online reviews on product sales and

countermeasures under big data analysis",2020.
7. Wajdi Aljedaani ,Furqan Rustam, Stephani Ludi, Ali Ouni, “Learning

Sentiment Analysis for Accessibility User Reviews”, November 2021.
8. Xiaoliang Xie, Xiaokun Zhang, Haiyan Shi, "Analysis of online

shopping reviews based on natural language processing – Taking
Amazon as an example", Conference: 2021 IEEE 2nd International
Conference on Big Data, Artificial Intelligence and Internet of Things
Engineering (ICBAIE).
9. Guo Qinqin, Huang Hua, Mao Haifan "Research on the Impact of

Online Comment Features on Product Sales", 2021.

Mini Project Report

Uploaded by

Copyright:

Available Formats

Mini Project Report

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mini Project Report

Uploaded by

Copyright:

Available Formats

Font Type: Times New Roman

1. Headings: Font Size: 14(Bold)

In the partial fulfillment of the award of the degree of

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

VELAMMAL ENGINEERING COLLEGE

ANNA UNIVERSITY, CHENNAI - 600 025

INTERNAL GUIDE, HEAD OF THE DEPARTMENT,

Guide Name, Dr.P.Visu,

The MINI PROJECT Examination of this project work “SENTIMENT ANALYSIS ON

Student Name (Reg No)

university examination held on...............................

Behind every achievement lies an unfathomable sea of gratitude to those who

CHAPTER NO. TITLE PAGE NO.

5 SNAPSHOTS AND DISCUSSION

Title of the project: - Restaurant Sentiment Analysis.

1 Operating System: WINDOWS 10

Sentiment analysis is the prediction of emotions in a word, sentence or corpus

Therefore, sentiment analysis has wide applications and includes emotion

We are moving towards an automated and digital world. Having a significant

Future work is also discussed so that an efficient analysis system can be

Customer satisfaction is an opinion or feeling between expectation and reality

Customer reviews on the online media become important as it might increase

Although improving quality through this method is considered useful, only

Humans ourselves are not able to understand how exactly language is

Sentiment analysis isn’t as straightforward as it may seem. If you think that

Reviews and Ratings are useful sources of information but significant

This research paperwork can be applied to many other industries related to

A survey of Sentiment Analysis Challenges:

The consequences of challenges in the area of sentiment analysis have been

The effect of this distinction shows that domain-dependence is an

Theoretical and technological are the two types of sentiment analysis

Aspect based Sentiment Oriented Summarization of Hotel Reviews:

Due to the unstable size of review dimensions and customer produced

Whether it is a business intelligence problem or a case of unstructured

The reviews were summarized on different aspects and sentiment analysis

A review can be considered useful for decision making purposes only

The indicators representing the importance of reviews are different for

A framework for Fake Review Detection in Online:

F3 estimates knowledge obtained from both the user (personal profile,

Sentiment Analysis of Hotel Reviews:

It is observed that Semantic orientation can also be used as a sentiment

It is possible to classify a review on the foundation of the average semantic

 NumPy (Numerical programming package for the Python

3.2.2. FROM SCIKIT LEARN:

3.2.3 FROM MATPLOTLIB:

Incorrect sentiment analysis data preparation affects the algorithm and

Some of the measures to conduct data cleaning include:

 Delete duplicate data

To clean data we use tools such as

Stop words are available in abundance in any human language. By

Movie review: “The movie was not good at

all.” Text after removal of stop words: “movie

Stemming is considered to be the more crude/brute-force approach to

Some Stemming algorithms are

 Porter’s Stemmer algorithm

PORTER’S STEMMER ALGORITHM:

It is one of the most popular stemming methods proposed in 1980. It is

Machines cannot understand characters and words. So when dealing with