0% found this document useful (0 votes)

23 views

Main

Uploaded by

raviyallampati

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

Main

Uploaded by

raviyallampati

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 43

FAKE REVIEWS DETECTION BASED ON INDEX OPTIMIZATION

A Main Project Report submitted to

JNTUA, Ananthapuramu

In partial fulfillment of the requirements for the award of the degree of

Bachelor of Technology

(INFORMATION TECHNOLOGY)
By
A. SANTHOSH (20KB1A1201) K. VENKATESWARLU (20KB1A1230)
V. NITHIN (20KB1A1259) Y. RAVI (20KB1A1261)
N. JAGADEESH (21KB5A1205)

Under the esteemed guidance of

Mr. M. SIVAPRATAP REDDY

Assistant Professor, Dept. of IT and AI&DS

DEPARTMENT OF IT and AI&DS

N.B.K.R INSTITUTE OF SCIENCE & TECHNOLOGY
(Autonomous)
VIDYANAGAR – 524 413, TIRUPATI DIST, AP
MAY 2024

1
Dept. of IT and AI&DS,NBKRIST
Website: www.nbkrist.org Email: ist@nbkrist.org
Ph: 08624-228 247 Fax: 08624-228 257

N.B.K.R. INSTITUTE OF SCIENCE & TECHNOLOGY

(Autonomous)
(Approved by AICTE: Accredited by NBA: Affiliated to JNTUA, Ananthapuramu)

An ISO 9001-2000 Certified Institution

Vidyanagar -524 413, Tirupati District, Andhra Pradesh, India

BONAFIDE CERTIFICATE
This is to certify that the Project work entitled “FAKE REVIEWS DETECTION BASED
ON INDEX OPTIMIZATION” is a bonafide work done by A. Santhosh (20KB1A1201),
K. Venkateswarlu (20KB1A1230), V. Nithin (20KB1A1259), Y. Ravi (20KB1A1261), N.
Jagadeesh (21KB5A1205), and in the department of IT and AI&DS, N.B.K.R. Institute of
Science & Technology, Vidyanagar and is submitted to JNTUA, Ananthapuramu in the
partial fulfillment for the award of B.Tech degree in Information Technology. This work
has been carried out under my supervision.

Mr. M. SIVA PRATHAP REDDY Dr. A. NARAYANA RAO

Assistant Professor Professor & Head
Department of IT and AI&DS Department of IT and AI&DS
N.B.K.R.I.S.T, Vidyanagar N.B.K.R.I.S.T, Vidyanagar

Submitted for the Viva-Voce Examination held on ____________

Internal Examiner External Examiner

2
Dept. of IT and AI&DS,NBKRIST
TABLE OF CONTENTS

CHAPTER NO CHAPTER NAME PAGE NO

BONAFIDE CERTIFICATE

LIST OF FIGURES

LIST OF TABLES

AKNOWLEDGEMENT

ABSTRACT

1 INTRODUCTION

1.1 INTRODUCTION

1.2 MOTIVATION

1.3 PROBLEM STATEMENT

1.4 OBJECTIVES AND SCOPE

1.5 ORGANIZATION OF THE PROJECT

1.6 SUMMARY

2 LITERATURE SURVEY

2.1 EXISTING SYSTEM

2.2 SURVEY

2.3 PROPOSED SYSTEM

3
Dept. of IT and AI&DS,NBKRIST
3 METHODOLOGY

3.1 INTRODUCTION

4 IMPLEMENTATION

4.1 INTRODUCTION

4.2 MODEL TRAINING

4.3 CREATE AN UI

4.4 REQUIREMENTS

5 RESULTS AND ANALYSIS

5.1 PIE CHART OF DATASET

5.2 HISTOGRAM OF HAM REVIEWS

5.3 HISTOGRAM OF SPAM REVIEWS

5.4 CONFUSION MATRIX

5.5 NAVIE BAYES PREDICTION

5.6 SVM PREDICTION

5.7 ACCURACY OF NAVIE BAYES

5.8 ACCURACY OF SVM

5.9 CLASSIFICATION REPORT OF NB

5.10 CLASSIFICATION REPORT OF SVM

5.11 USER INTERFACE

5.12 PREDICTION SPAM & HAM

6 CONCLUSION AND FUTURE ENHANCEMENTS

6.1 CONCLUSION

4
Dept. of IT and AI&DS,NBKRIST
6.2 FUTURE ENHANCEMENTS

7 REFERENCES

5
Dept. of IT and AI&DS,NBKRIST
LIST OF FIGURES

FIGURE NO FIGURE NAME PAGE NO

1 COMPARISION OF DIFFRENT
MODELS
2 NAÏVE BAYES ACCURACY
MODEL
3 SVM CLASSIFICATION REPORT
4 SYSTEM ARCHITECTURE
5 DATASET
6 CONFUSION MATRIX
7 PIE CHART
8 HISTOGRAMS
9 API
10 RESULT

6
Dept. of IT and AI&DS,NBKRIST
ACKNOWLEDGEMENT
The satisfaction that accompanies the successful completion of a project would be
incomplete without the people who made it possible of their constant guidance
and encouragement crowned our efforts with success.

We would like to express our profound sence of gratitude to our project guide Mr. M.
SIVA PRATHAP REDDY, Associate Professor, Department of IT and AI&DS,
N.B.K.R.I.S.T (affiliated to JNTUA, Ananthapuramu), Vidyanagar, for his masterful
guidance and the constant encouragement throughout the project. Our sincere thanks for
him suggestions and unmatched services without, which this work would have been an
unfulfilled dream.

We convey our special thanks to Dr. Y. Venkata Rami Reddy respectable chairman of
N.B.K.R. Institute of Science and Technology, for providing excellent infrastructure in
our campus for the completion of the project

We convey our special thanks to Sri N. Ram Kumar Reddy respectable correspondent
of N.B.K.R. Institute of Science and Technology, for providing excellent
infrastructure in our campus for the completion of the project.

We are grateful to Dr. V. Vijaya Kumar Reddy, Director, of N.B.K.R Institute of

Science and Technology for allowing us to avail all the facilities in the college.

We express our sincere gratitude to Dr. A. Narayana Rao, Professor, Head of

Department, IT and AI&DS, for providing exceptional facilities for successful
completion of our project work.

We would like to convey our heartful thanks to Staff members, Lab technicians, and
our friends, who extended their cooperation in making this project as a successful
one.

We would like to thank one and all who have helped us directly and indirectly to
complete this project successful

7
Dept. of IT and AI&DS,NBKRIST
ABSTRACT

Buying and selling the goods and services through internet called as electronic network
known to be E-commerce. Due to the convenience of e-commerce, the number of users are
increased. Meanwhile, the people review of product also increased. In e-commerce websites,
fake review is often the major problem. Nowadays, it is known to be common that user can
write the review for their purchased product.

There are many ways that user can write reviews. Using this opportunity, there is a possibility
that spammers can leave fake review. Many users determine the quality of product based on
user’s reviews. So, the fake review creates lot of problems on product quality, sales, and
economic growth. To tackle this problem, we are going to use Naive Bayesian classifier and
svm classifier which are very simple and easy technique to classify the product review.
Feature extraction can be used to extract the feature . Here we are using dataset for
classifying the product reviews. Here our aim is to find fake reviews. By detecting fake
reviews the accuracy of e-commerce system can be improved.

8
Dept. of IT and AI&DS,NBKRIST
1. INTRODUCTION

1.1 INTRODUCTION

The way the people express their opinions and communicate with others on the web is
radically changed. Nowadays, customers rely a lot on the written reviews before going to
purchase the product. There are basically two types of spam reviews. First type is
“positive/negative reviews” and second type is” no reviews”. The positive/negative reviews
of product gives opinion on their product selection. So, positive review makes good opinion
about the product as well as increases the quality of product while negative review makes
bad impression about product as well as destroys the reputation of product. Second type, no
reviews (e.g. Ads) has no opinion on the product. It is also a kind of
encouragement/discouragement for customer to buy the product. The growth of e-commerce
sites also increase the number of spam reviews. 20% of reviews on e-commerce websites
were actually fake. In order to detect the fake reviews we are going to utilize feature
selection, extraction techniques and classification methods. There are several algorithms for
classifying the spam and non-spam data such as vector machine, decision tree, Naïve Bayes,
neural network are well-known classifiers . For process and analyze the data set, we first
apply feature extraction technique over the selected attributes. Overall knowledge of various
classifier, we find that the accuracy can be improved using Naïve based and svm
classification techniques. The performance of Naïve based and svm classifiers were found to
be better than other classifiers.

1.2 MOTIVATION

The motivation for implementing this spam reviews detection project “Fake reviews
detection using index optimization” is to improve maintaining trust, fairness and authenticity
in online reviews, which ultimately leads to better user experiences and brand reputation.
Reviews play a crucial role in shaping consumer decisions. Spam or fake reviews can
undermine trust in a product or service, leading to potential loss of customers and reputation
damage for businesses. Authenticity is key in online platforms. Spam reviews distort the
overall rating and mislead potential customers. Detecting and removing spam reviews ensure

9
Dept. of IT and AI&DS,NBKRIST
a fair representation of products or services. Spam reviews clutter the review space and make
it difficult for users to find relevant information. Removing spam helps in reducing noise and
improving the overall quality of reviews. For businesses, their online reputation is invaluable.
Spam reviews can tarnish a brand's image. Detecting and removing spam helps protect the
brand's reputation and integrity.

1.3 PROBLEM STATEMENT

Given a dataset of user-generated reviews for products or services, the objective is to develop
a machine learning model that can accurately classify reviews as either spam or genuine.
Spam reviews are defined as those that are fake, deceptive, or manipulative in nature, aiming
to influence the perception of the product or service unfairly. The model should be able to
effectively distinguish between legitimate reviews and spam reviews, thus helping maintain
the integrity and trustworthiness of the review platform.

1.4 OBJECTIVES AND SCOPE

1.Devoloping a Robust Model: Create a machine learning model capable of accurately

distinguishing between spam and legitimate reviews. This involves exploring various
algorithms, feature engineering techniques, and data preprocessing methods to optimize
model performance.
2.Data Collection & Feature engineering: collecting a dataset related to product reviews
then Identify and extract relevant features from the review text, metadata (e.g., timestamp,
reviewer information), and other contextual information. These features should capture the
characteristics indicative of spam or genuine reviews.
3.Model Evaluation: Evaluate the performance of the developed model using appropriate
metrics such as accuracy, precision, recall, F1-score. Conduct thorough testing on validation

and test datasets to ensure robustness and generalization.

10
Dept. of IT and AI&DS,NBKRIST
4.Deployment&Integrate: Integrate the trained model into the existing review platform
infrastructure, either as a standalone system or as part of a larger spam detection pipeline.
Ensure seamless integration with minimal disruption to user experience.

5.Monitoring & Improvement: Implement mechanisms for monitoring model performance

in real-time and collecting feedback from users and moderators. Utilize this feedback to

iteratively improve the model's effectiveness and adapt to evolving spamming techniques .
SCOPE:
1.Review types: Focus on detecting spam reviews in text-based reviews for products,
services, or businesses across various domains (e.g., e-commerce, hospitality, restaurants).
2.Platforms: Determine the online platforms where spam reviews will be targeted for
detection. This could include e-commerce websites, review aggregation sites, social media
platforms, and any other platforms hosting user-generated reviews.
3. Spam Categories: Define the categories or characteristics of spam reviews that the
detection system will focus on. This may include fake reviews, biased reviews, reviews
posted by bots, reviews containing promotional content, or any other forms of deceptive or
manipulative content.
4.Detection Techniques: Specify the machine learning methods and algorithms to be
employed for spam detection.
5.Feature Selection: Identify the features and attributes of reviews that will be used for spam
detection. This may include textual features, metadata (e.g., reviewer text), analysis.
6.Data Sources: Utilize publicly available datasets, proprietary data from the review
platform (if available), and synthetic data generation techniques to augment the training
dataset.
7.Model Maintenance: Develop procedures for model maintenance, including version
control, retraining schedules, and performance monitoring. Establish protocols for handling
model drift and concept drift over time.

1.5 ORGANIZATION OF PROJECT REPORT

11
Dept. of IT and AI&DS,NBKRIST
Organizing a project report effectively is crucial for presenting your findings, insights, and
recommendations clearly and logically. Here's a suggested structure for organizing a project

report on spam reviews detection:

1.Title Page:
 Project title
2.Abstract:
 A brief summary of the project objectives, methods, Key findings, and Conclusions.
3.Table of Contents:
 List of sections and subsections with corresponding page numbers.
4.Introduction:
 Background and Context of the problem (Spam reviews detection.
 Motivation for the project.
 Objectives of the project.
 Overview of the report structure.
5.Literature Review:
 Reviews of existing research, methods, and technologies related to spam reviews
detection.
 Discussion of relevant theories, models, and approaches in the field.
6.Methadology:
 Description of the dataset used for training and testing.
 Explanation of data collection and preprocessing techniques.
 Overview of feature engineering methods.
 Explanation of the machine learning or detection algorithms employed.
 Details of model evaluation metrics and techniques.
7. Results:
 Presentation of experimental results, including model performance metrics.
 Visualization of key findings (e.g., confusion matrix, ROC curves).
 Discussion of any challenges encountered during experimentation.
8.Conclusion:
12
Dept. of IT and AI&DS,NBKRIST
 Summary of the key findings and insights.
 Recapitulation of the project objectives and whether they were achieved.

 Implications of the findings for the field of spam reviews detection.

9.Reference:
 Implications of the findings for the field of spam reviews detection.

1.6 SUMMARY

13
Dept. of IT and AI&DS,NBKRIST
2.LITERATURE SURVEY

A literature survey, also known as a literature review or literature search, is a critical

examination and synthesis of existing research and literature related to a particular topic or
subject area. It involves systematically gathering, evaluating, and analyzing published works
such as academic papers, books, articles, and reports that are relevant to the research question
or topic of interest.

2.1 EXISTING SYSTEM

The existing system that propose “Product spam reviews detection based on index
optimization” by using machine learning algorithms. In existing system they used a dataset
consists of 11 attributes called Product name, Commodity property, Positive evaluation word,
Negative evaluation word, Positive effect word, Negative effect word, Review on the length
of the text, Number of votes for review content, and Review on the users credit experience.
Steps follow to implement:
1.Load the dataset
2.Selecting the indexes by using Index optimization techniques.
3.preprocess the selected data fields.
4.Split dataset into train and test data.
5. train the naïve bayes and svm models by giving the test and train data.
6.Model evaluation facors like f1 score, recall, and precition rate.
7.Test the model by giving input reviews.

14
Dept. of IT and AI&DS,NBKRIST
Accuracy of model:

Fig:-1
DEMERITS:
 Accuracy low
 Similar data indexes
 Duplication data

2.2 SURVEY

Doing this “fake reviews detection based on index optimization” project , I was going
through old conference papers and researched.

Year Authors Title Description Results Drawbacks

2019 Ai-jun Product spam Provide overview Detected spam Accuracy of

LI,Lei SHI reviews of spam reviews reviews based on model is low
detection based and detecting index
on index methods optimization and
optimization ml algo

2020 N. Survey of fake Provide an Summarizing May not

Abdelmage reviews overview of their generalized
ed , H. tork various techniques performances overview
and and challenges in
Hussein detecting fake
reviews across
different platforms.

15
Dept. of IT and AI&DS,NBKRIST
2018 Han yutan False comment False comments Improving online Getting low
recognition recognitions based content accuracy.
on CNN. evaluation.

2017 S. What yelp fake Analyzes the yelp Offers insights Limited to the
Mukarjee, review filter fake review filter into the yelp platform
N. Glance might be doing. and proposes a challenges and and findings
framework for effectiveness of may not
understanding existing spam apply
review spam. filters. universally.

2.3 PROPOSED SYSTEM

By doing above research I proposed new system called “Fake reviews detection based on
index optimization” by using machine learning algorithms and index optimization
techniques.In proposed system we took a diffrent dataset compare with an existing model.
dataset contains 6 attributes which are category, review, rating, likes, username, label.
Steps followed to implement:
1.Load the dataset
2.Selecting the indexes by using Index optimization techniques.
3.preprocess the selected data fields.
4.Split dataset into train and test data.
5. train the naïve bayes and svm models by giving the test and train data.
6.Model evaluation facors like f1 score, recall, and precition rate.
7.create an user interface using python streamlit.
8.Test the model by giving input reviews.
While comparing with the existing model in proposed model we vary the dataset contains 8
fields by using index optimization techniques. which are gave optimized accurate
results .Also the proposed model increase the efficiency of accurate results and reduce the
noisy.
Accuracy of model:

16
Dept. of IT and AI&DS,NBKRIST
Fig:-2 Naïve Bayes Model Classification Report

Fig:-
Fig:-3 SVM Model Classification Report

MERITS:
 High Accuracy
 Accurate Predictions

17
Dept. of IT and AI&DS,NBKRIST
3. METHODOLOGY

3.1 INTRODUCTION
The methodology of spam reviews detection based on index optimization involves utilizing
various techniques to identify and filter out fraudulent or deceptive reviews from genuine
ones. Here's a breakdown of what this methodology typically involves:
1.Data collection: Data collection in spam review detection involves gathering a diverse set
of reviews from various sources, such as e-commerce websites, forums, or social media
platforms. These reviews can encompass different products, services, and domains to ensure
the model's robustness across various contexts.
Overall, effective data collection is foundational for training accurate and reliable spam
review detection models. It ensures that the model learns from a comprehensive and
representative sample of reviews, leading to better performance in identifying and filtering

out spam content.

18
Dept. of IT and AI&DS,NBKRIST
2.Data preprocessing: Data preprocessing in spam review detection involves transforming
raw review data into a format that is suitable for analysis and model training. Here's a typical
workflow for preprocessing data in this context:
 Removing special characters like alpha numeric, punctuations and symbols
 Convert all text into lowercase
 Split the sentence into tokens for further analysis
 Eleminate common stop words like(the, is, and)
 Stemming words to its root form
3.Feature selection & Extraction: Feature selection and extraction in spam review
detection involve identifying and extracting relevant information from the review data to

build effective models.

By using Index optimization techniques like select kbestand chi-sqare methods selecting the
features.
4.Model selection & Training: Model selection and training in spam review detection
involve choosing appropriate machine learning algorithms and optimizing their parameters to
build an effective spam detection system.
In model selection selecting the machine learning models which are suitable for classification
such as naïve bayes, svm, logistic regression, decision trees, and random forest algorithms.
In model training involves several steps:
 Split the dataset into train and test.
 Preprocess datasets.
 Train the model
5.Model evaluation: Model evaluation in spam reviews detection involves assessing the
performance of the trained model to ensure its effectiveness in distinguishing between spam
and genuine reviews.
Accuracy: Accuracy measures the overall correctness of the model and is calculated as (TP +
TN) / (TP + TN + FP + FN).
Precision: Precision measures the proportion of correctly classified spam reviews among all
reviews classified as spam and is calculated as TP / (TP + FP).

19
Dept. of IT and AI&DS,NBKRIST
F1-score: F1-score is the harmonic mean of precision and recall and is calculated as 2 *

(precision * recall) / (precision + recall). It provides a balanced measure of a model's

performance, especially when dealing with imbalanced datasets.
Recall: Recall measures the proportion of correctly classified spam reviews among all actual
spam reviews and is calculated as TP / (TP + FN).
6.Testing model: Testing the model by giving input review to the trained model and predict
the review is legitimate or spam.
7.Deployment: Deploy the model into any platform.

Model integration

Testing model

Deployment
20
Dept. of IT and AI&DS,NBKRIST
SYSTEM ARCHITECTURE

Data Collection

Data preprocessing & Extraction

Feature selection & Extraction

Model selection & Training

Model evaluation

Testing model

Deployment

Fig:-4 Architecture

21
Dept. of IT and AI&DS,NBKRIST
4. IMPLEMENTATION

4.1 INTRODUCTION
Implementing a spam review detection system involves several steps, including data
collection, preprocessing, feature extraction, model training, and evaluation. Here's a high-
level overview of how you can approach each step:

Data Collection: Collecting reviews dataset from e-commerce websites are any other free
sources.

For this project collected the reviews dataset related products from below website link then
added some additional attributes to that.

Dataset Source: https://pythongeeks.org/fake-product-review-detection-using-machine-

learning/

Fig:-5 Dataset
Preprocess: Pre-processing involves in data cleaning for analysis such as removing
duplications, removing stop words, converting text into lower case, tokenization, removing
punctuations, stemming.

22
Dept. of IT and AI&DS,NBKRIST
Feature selection&extraction: Feature selection&extraction is the process of selecting
indexes which are giving accurate results based on weights or scores.

SelectKBest: is a method for feature selection that selects the top k features with the highest
scores based on a specified criterion. The criterion could be a statistical measure like mutual
information, F-score, or chi-square.

Chi-square: is a statistical test used to determine the association between categorical

variables. In the context of feature selection, chi-square can be used to assess the
independence between each feature and the target variable. Features with higher chi-square
values are considered more informative for predicting the target variable.

TF-IDF Score: TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical

statistic used to reflect the importance of a word in a document relative to a collection of
documents (corpus). It's commonly used in natural language processing and information
retrieval for tasks like text classification, clustering, and search engine ranking.

Here’s abreakdown of how TF-IDF is calculated:

 Term Frequency(TF): It measures how frequently a term occurs in a document. It is

calculated as the ratio of the number of times a term appears in a document to the
total number of terms in the document. It helps to highlight the significance of terms
within individual documents.

TF(t,d) = no.of times term t appears in document d / total no.of terms in document d

 Inverse Document Frequency(IDF): It measures the importance of a term across the

entire corpus by penalizing terms that occur frequently across documents. It is
calculated as the logarithm of the ratio of the total number of documents to the
number of documents containing the term, plus one to avoid division by zero for
terms that do not appear in the corpus.

IDF(t, d) = log(no .of documents in corpus / no.of documents contain the term)

Numerical Example

Imagine the term 𝑡 appears 20 times in a document that contains a total of 100 words.
Term Frequency (TF) of 𝑡 can be calculated as follow:

23
Dept. of IT and AI&DS,NBKRIST
𝑇𝐹=20/100=0.2

Assume a collection of related documents contains 10,000 documents. If 100 documents

out of 10,000 documents contain the term 𝑡, Inverse Document Frequency (IDF) of 𝑡 can
be calculated as follows

𝐼𝐷𝐹=𝑙𝑜𝑔(10000/100)=2

Using these two quantities, we can calculate TF-IDF score of the term 𝑡 for the
document.

TF-IDF=0.2∗2=0.4

The TF-IDF represents score to every term in sentence like below vector matrix form.

24
Dept. of IT and AI&DS,NBKRIST
4.2 MODEL TRAINING:

Implemented Algorithms:

1.Naïve Bayes

Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes

theorem and used for solving classification problems.It is mainly used in text
classification that includes a high-dimensional training dataset.Naïve Bayes Classifier is one
of the simple and most effective Classification algorithms which helps in building the fast
machine learning models that can make quick predictions.It is a probabilistic classifier,
which means it predicts on the basis of the probability of an object.

Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.The Naïve Bayes algorithm is comprised of two
words Naïve and Bayes, Which can be described as:

Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases
of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple.
Hence each feature individually contributes to identify that it is an apple without depending
on each other.

Bayes: It is called Bayes because it depends on the principle of Bayes Theorem.

25
Dept. of IT and AI&DS,NBKRIST
Bayes' Theorem:
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.

The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Comparing bayes theorem with our dataset:

Given a dataset with features X = (x1, x2, x3,…,xn) and a target variable y , where xi
represents the value of the ith feature, and y represents the class label, the Naive Bayes
algorithm calculates the probability of each class given the features using Bayes' theorem:

P(y∣x1,x2,...,xn)=P(x1)×P(x2)×...×P(xn)P(y)×P(x1∣y)×P(x2∣y)×...×P(xn∣y)
p(x1)*p(x2)*….*p(xn)

During training, the algorithm calculates the prior probability of each class 𝑃(𝑦) and the
conditional probability of each feature given the class 𝑃(𝑥𝑖∣𝑦) based on training data.

1.Prior probability : Calculate the prior probability of each class (positive and negative).

26
Dept. of IT and AI&DS,NBKRIST
P(Positive) = no . of positive reviews / total no . of reviews

P(Nagitive) = no.of negative reviews /total no.of reviews

2.conditional Probability:

 Calculate the conditional probability of each word given the class.

Foe example: Good Mobile

 Calculate p(Good/Positive), p(Mobile/Positive) ..etc

 These probabilities can be calculated using maximum likelihood estimation or
smoothing techniques like Laplace smoothing.

There are three types of Naive Bayes Model, which are given below:

Gaussian: The Gaussian model assumes that features follow a normal distribution. This
means if predictors take continuous values instead of discrete, then the model assumes that
these values are sampled from the Gaussian distribution.

Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular
document belongs to which category such as Sports, Politics, education,etc.The classifier
uses the frequency of words for the predictors.

Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular word is
present or not in a document. This model is also famous for document classification tasks.

Creating Confusion Matrix:

27
Dept. of IT and AI&DS,NBKRIST
Model Prediction :

New review: "Mobile condition is good"

Tokenization:

Tokenize the review into individual words: ["Mobile", "condition", "is", "good"].

Calculate probabilities:

For each class (positive and negative), calculate the probability of the review belonging to
that class using Bayes' theorem:

P(Positive∣review)∝P(Positive)×P("Mobile"∣Positive)×P("condition"∣Positive)×…
𝑃(Negative∣review)∝𝑃(Negative)×𝑃("Mobile"∣Negative)×𝑃"condition"∣Negative)×…

Note: The symbol ∝ means "proportional to."

Normalize probabilities: normalize the probabilities so that they sum to 1.

Prediction: Select the class (positive or negative) with the highest probability as the
predicted sentiment for the review.

28
Dept. of IT and AI&DS,NBKRIST
2.Support Vector Machine

Support Vector Machine (SVM) is a powerful supervised learning algorithm used for
classification and regression tasks. In the context of text classification, SVM is often used
for tasks like spam detection. SVM works by finding the optimal hyperplane that separates
data points of different classes with the largest possible margin. Here's an in-depth
explanation of how SVM works:

Linear svm : Consider a binary classification problem where we have two classes: positive
(+1) and negative (-1). SVM aims to find a hyperplane that best separates the data points of
these two classes in feature space.

Margin: The margin is the distance between the hyperplane and the nearest data point from
either class. SVM aims to maximize this margin because a larger margin typically results in
better generalization to unseen data.

Hyperplane: In a two-dimensional feature space, a hyperplane is a line. In higher

dimensions, it becomes a hyperplane. The equation of a hyperplane in a 𝑑-dimensional space
is given by:

Wtx+b=0

Where,

W = weight vector

X = input feature vector

29
Dept. of IT and AI&DS,NBKRIST
B = bias term

Objective function:The objective of SVM is to find the hyperplane that maximizes the
margin while minimizing the classification error. This can be formulated as an optimization
problem.

Minimize ½ ||w||2

Subject to:

Yi(wT xi +b) >= 1 for all training examples (xi,yi)

Where,

(xi,yi) are the training examples.

yi is the class label (+1 or -1).

Let's consider a simple example of spam reviews detection using SVM. Suppose we have a
dataset of reviews labeled as spam or non-spam. Each review is represented as a feature
vector 𝑥x (e.g., TF-IDF vector) and belongs to a class 𝑦 (spam or non-spam).

Training:

• We feed the labeled training data (feature vectors and corresponding labels) into the
SVM algorithm.

• SVM learns the optimal hyperplane that separates spam reviews from non-spam
reviews with the largest margin.

30
Dept. of IT and AI&DS,NBKRIST
Prediction:

• Given a new review, we represent it as a feature vector 𝑥 new.

• We use the learned SVM model to predict the class label for the new review.

• If the value of 𝑤𝑇𝑥new+𝑏 is positive, the review is classified as spam; otherwise, it's
classified as non-spam.

Mathematical Formulas:

Hyperplane equation: 𝑤𝑇𝑥+𝑏=0

Objective function: Minimize ½ ||w||2

subject to:

𝑦𝑖(𝑤𝑇𝑥𝑖+𝑏)≥1yi(wTxi+b)≥1

Lagrangian function: 𝐿(𝑤,𝑏,𝛼)=½||w||2−∑Ni=1𝛼𝑖[𝑦𝑖(𝑤𝑇𝑥𝑖+𝑏)−1]

Dual formulation: Maximize ∑iN=1𝛼𝑖−1/2∑iN=1∑jN=1𝛼𝑖𝛼𝑗𝑦𝑖𝑦𝑗𝑥𝑖T𝑥𝑗

Subject to:

∑iN aiyi = 0

Ai > 0 for all i

Kernel function: 𝐾(𝑥𝑖,𝑥𝑗)K(xi,xj)

31
Dept. of IT and AI&DS,NBKRIST
Confusion Matrix:

Fig:-6 SVM Decision line

4.3 CREATE AN UI
4.3.1 Streamlit:

Streamlit is an open-source Python library that allows you to create interactive web
applications directly from Python scripts. It's designed to make it easy and fast for data
scientists and machine learning engineers to build and share data-driven applications without
needing to have web development skills.

Here's how Streamlit works:

Write Python Script : With Streamlit, you write your application logic in Python scripts
using simple and intuitive syntax. You can use familiar Python libraries like NumPy, Pandas,
and Matplotlib for data processing, visualization, and machine learning tasks.

32
Dept. of IT and AI&DS,NBKRIST
Declarative Programming : Streamlit uses a declarative programming model, which means
that you specify what you want to appear on the web page, and Streamlit takes care of the
underlying HTML, CSS, and JavaScript code to render the user interface.

Realtime Updates : Streamlit automatically updates the web page in real-time as you
modify your Python script. This allows you to see changes immediately as you make them,
without needing to manually reload the web page.

Widgets and Components : Streamlit provides a wide range of widgets and components
that you can use to build interactive elements in your application, such as sliders,
dropdowns, buttons, and text inputs. These widgets allow users to interact with your data and
control the behavior of your application.

Installation Of streamlit :

Using pip command Install the streamlit see below then import the streamlit

Pip install streamlit

Import streamlit as st

4.4 REQUIREMENTS

4.4.1 Hardware Requirements :

H/W Configuration:

 Processor - I3/Intel Processor

 Hard Disk - 160GB
 RAM - 8GB

S/W requirements:

 Operating System : Windows 7/8/10

 User interface design : streamlit
 IDE : PyCharm , Jupyter
 Libraries Used : Numpy , pandas, sklearn, seaborn, matplotlib,nltk,
pickle

33
Dept. of IT and AI&DS,NBKRIST
 Web Browser : Microsoft edge
 Technology : Python

5. RESULTS AND ANALYSIS

5.1 PIECHART OF DATASET

Fig:-7 Pie Chart

5.2 HISTOGRAM OF HAM REVIEWS

34
Dept. of IT and AI&DS,NBKRIST
Fig:-8 Histogram of Ham Reviews
5.3 HISTOGRAM OF SPAM REVIEWS

Fig:-8 Histogram of Spam Review

5.4 CONFUSION MATRIX

5.4.1 Naïve bayes model :

35
Dept. of IT and AI&DS,NBKRIST
Fig:-9 Confusion Matrix

5.4.2 SVM Model :

Fig:-9 Confusion Matrix

5.5 NAÏVE BAYES PREDICTION

36
Dept. of IT and AI&DS,NBKRIST
5.6 SVM PREDICTION

5.7 ACCURACY OF NAÏVE BAYES

37
Dept. of IT and AI&DS,NBKRIST
5.8 ACCURACY OF SVM

5.9 CLASSIFICATION REPORT OF NB

5.10 CLASSIFICATION REPORT OF SVM

38
Dept. of IT and AI&DS,NBKRIST
5.11 USER INTERFACE

Fig:-10 API

5.12 PREDICTION SPAM/HAM

Fig:-11 Result

39
Dept. of IT and AI&DS,NBKRIST
6. CONCLUSION AND FUTURE ENHANCEMENTS

6.1 CONCLUSION:

In conclusion, spam reviews detection is a crucial task in various domains, including e-

commerce, hospitality, and online platforms, where the authenticity and credibility of user-
generated content are paramount. Through the utilization of advanced machine learning
techniques, particularly supervised learning algorithms like Support Vector Machines
(SVM), Naïve Bayes algorithms and feature engineering strategies, effective spam review
detection systems can be developed.

The implementation of SVM models allows for the classification of reviews into spam or
non-spam categories based on their textual content, leveraging features such as TF-IDF
representations. These models are trained on labeled datasets, enabling them to learn patterns
and characteristics indicative of spam reviews, such as excessive promotional language,
irrelevant content, or deceptive practices.

spam review detection systems represent a critical component in maintaining the integrity
and trustworthiness of online platforms and services. By leveraging advanced machine
learning techniques and feature engineering methodologies, these systems can effectively
identify and filter out spam content, thereby enhancing user experience, trust, and credibility
in online communities.

6.2 FUTURE ENHANCEMENTS

The Proposed project can be further developed in multilevel classification by using

machine learning and deep learning algorithms like decision trees, random forest,
convolutional networks.

40
Dept. of IT and AI&DS,NBKRIST
7. REFERENCES

1. Product spam reviews detection based on index optimization by Ai-jun LI, Lei SHI.
Provide overview of spam reviews and detecting methods.
2. Survey of fake reviews by N. Abdelmageed , H. tork and Hussein in 2020. Provide an
overview of various techniques and challenges in detecting fake reviews across different
platforms.
3. False comment recognition by Han yutan in 2018. False comments recognitions based on
CNN.
4.LiXiao, DingShengchun. Research on the identification of spam comment information
[2013].
5. You Guirong, WuWei, Qian Yuntao. Feature extraction method of spam review detection
in e-commerce [2014].

41
Dept. of IT and AI&DS,NBKRIST
42
Dept. of IT and AI&DS,NBKRIST
43
Dept. of IT and AI&DS,NBKRIST

Smart Bill August
67% (3)
Smart Bill August
8 pages
REAL TIME CREDIT CARD FRAUD DETECTION USING PYTHON 1
No ratings yet
REAL TIME CREDIT CARD FRAUD DETECTION USING PYTHON 1
60 pages
Project Report
70% (10)
Project Report
47 pages
Project Managment Ariba HW
No ratings yet
Project Managment Ariba HW
2 pages
1822 B.E Cse Batchno 126
No ratings yet
1822 B.E Cse Batchno 126
58 pages
Vaishali Final Project Book
No ratings yet
Vaishali Final Project Book
53 pages
MINI DOCC LAST (1)_removed
No ratings yet
MINI DOCC LAST (1)_removed
52 pages
A Major Project Report On: Bachelor of Technology
No ratings yet
A Major Project Report On: Bachelor of Technology
109 pages
AI4 Career Guider Report
No ratings yet
AI4 Career Guider Report
51 pages
Sample
No ratings yet
Sample
95 pages
Parkinson's Disease Detection
No ratings yet
Parkinson's Disease Detection
88 pages
Dbmsfinal
No ratings yet
Dbmsfinal
47 pages
Intro 2
No ratings yet
Intro 2
10 pages
Confirmed Project
No ratings yet
Confirmed Project
13 pages
HEALVIBE
No ratings yet
HEALVIBE
36 pages
Multiclass Prediction Model For Student Grade Prediction Using Machine Learning
No ratings yet
Multiclass Prediction Model For Student Grade Prediction Using Machine Learning
106 pages
160030058final PDF
No ratings yet
160030058final PDF
76 pages
Group-Project Final Documentation2
No ratings yet
Group-Project Final Documentation2
59 pages
Batch 5 1
No ratings yet
Batch 5 1
52 pages
Batch 4 - Revolutionizing Blood Cell Analysis
No ratings yet
Batch 4 - Revolutionizing Blood Cell Analysis
79 pages
Kalai 2
No ratings yet
Kalai 2
7 pages
House Price Prediction Using Machine Learning: A Project Report On
No ratings yet
House Price Prediction Using Machine Learning: A Project Report On
19 pages
First and Last
No ratings yet
First and Last
68 pages
Indore: A Project Report Submitted at
No ratings yet
Indore: A Project Report Submitted at
27 pages
A Mini Project Report Submitted in The Partial Fulfillment of The Requirements For The Award of The Degree of
No ratings yet
A Mini Project Report Submitted in The Partial Fulfillment of The Requirements For The Award of The Degree of
64 pages
D6_mainpage
No ratings yet
D6_mainpage
10 pages
R1 final
No ratings yet
R1 final
4 pages
Yoga Management Report (Final) - 2
No ratings yet
Yoga Management Report (Final) - 2
51 pages
An Anaya
No ratings yet
An Anaya
40 pages
Andriya-Seminar Repot (1) ..
No ratings yet
Andriya-Seminar Repot (1) ..
28 pages
Pallavan College of Engineering: Privacy-Preserving Public Auditing For Data Storage Security in Cloud Computing
No ratings yet
Pallavan College of Engineering: Privacy-Preserving Public Auditing For Data Storage Security in Cloud Computing
8 pages
B2 Salma Fayaz
No ratings yet
B2 Salma Fayaz
56 pages
Data Project Trial 3 Fucking Final PDF
No ratings yet
Data Project Trial 3 Fucking Final PDF
37 pages
505-mini
No ratings yet
505-mini
59 pages
E-Commerce Based Consumer Prediction and Behavior Analysis For Potential Business Intelligence
No ratings yet
E-Commerce Based Consumer Prediction and Behavior Analysis For Potential Business Intelligence
4 pages
Project Docu
No ratings yet
Project Docu
84 pages
mini[1]
No ratings yet
mini[1]
73 pages
Finals
No ratings yet
Finals
58 pages
Bachelor of Engineering Computer Science Engineering: A Major Project Report On
No ratings yet
Bachelor of Engineering Computer Science Engineering: A Major Project Report On
9 pages
Poc App Using Mit App Inventor Lab On Project Project Report
No ratings yet
Poc App Using Mit App Inventor Lab On Project Project Report
22 pages
REPORT
No ratings yet
REPORT
69 pages
17BIT202
No ratings yet
17BIT202
25 pages
RL Front Page
No ratings yet
RL Front Page
4 pages
MAJOR_1(B-16)
No ratings yet
MAJOR_1(B-16)
51 pages
Binder 1
No ratings yet
Binder 1
93 pages
Project Training
No ratings yet
Project Training
25 pages
Hardware Sales and Service
No ratings yet
Hardware Sales and Service
80 pages
1741852590647_1741852561310_1741851982938_1741851939919_1741851916345_FRONT PAGES C.SATHISH (2)
No ratings yet
1741852590647_1741852561310_1741851982938_1741851939919_1741851916345_FRONT PAGES C.SATHISH (2)
22 pages
report11
No ratings yet
report11
17 pages
Implementation of Online Library Management System: A Micro Project Report
No ratings yet
Implementation of Online Library Management System: A Micro Project Report
10 pages
Report 1 Crim
No ratings yet
Report 1 Crim
73 pages
Certificate GROUP
No ratings yet
Certificate GROUP
4 pages
Insurance
No ratings yet
Insurance
34 pages
A Project Report On
No ratings yet
A Project Report On
90 pages
Major Project 112
No ratings yet
Major Project 112
60 pages
Project Report
No ratings yet
Project Report
47 pages
helmatee
No ratings yet
helmatee
41 pages
DROWSINESS-DET_AddPage
No ratings yet
DROWSINESS-DET_AddPage
46 pages
Index Page
No ratings yet
Index Page
8 pages
Minor Project Report
No ratings yet
Minor Project Report
69 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
César Pérez López
No ratings yet
Data Literacy Practitioner's Guide: EDF Data Literacy Certification workbook
From Everand
Data Literacy Practitioner's Guide: EDF Data Literacy Certification workbook
Michel Dekker
No ratings yet
SwiftUI Views Mastery_iOS13 (Mark Moeykens) (Z-Library)
No ratings yet
SwiftUI Views Mastery_iOS13 (Mark Moeykens) (Z-Library)
589 pages
Schedule Adherence Wk37 Rev
No ratings yet
Schedule Adherence Wk37 Rev
31 pages
Updated e Invoice FAQs v6
No ratings yet
Updated e Invoice FAQs v6
24 pages
eGCA Manual
No ratings yet
eGCA Manual
39 pages
Courier Tracking System
86% (7)
Courier Tracking System
120 pages
Computer Science Worksheet Class: 7
No ratings yet
Computer Science Worksheet Class: 7
3 pages
Java Assignment 31 To 60
No ratings yet
Java Assignment 31 To 60
51 pages
Development and Evaluation of A Research Paper Repository Website - A Web-Based Platform For The Valenzuela City School of Mathematics and Science
No ratings yet
Development and Evaluation of A Research Paper Repository Website - A Web-Based Platform For The Valenzuela City School of Mathematics and Science
31 pages
How To Create A Dependent Drop-Down List in Google Sheets
No ratings yet
How To Create A Dependent Drop-Down List in Google Sheets
12 pages
DFDs Structure Charts
No ratings yet
DFDs Structure Charts
10 pages
Diferenciacion Numerica
No ratings yet
Diferenciacion Numerica
396 pages
APznzaY78tnl5Y oAf9eS5TdgeXPDlOW4T AmtqiY4PHThk2ZQBAlN TYg2qIhzN8is6Cyb37XgnGHte3fIwNnW5MPM2BaSySYl4QXhx fXWWBjZlqfyJgJ
No ratings yet
APznzaY78tnl5Y oAf9eS5TdgeXPDlOW4T AmtqiY4PHThk2ZQBAlN TYg2qIhzN8is6Cyb37XgnGHte3fIwNnW5MPM2BaSySYl4QXhx fXWWBjZlqfyJgJ
97 pages
UT Dallas Syllabus For cs6363.001 05f Taught by Balaji Raghavachari (RBK)
No ratings yet
UT Dallas Syllabus For cs6363.001 05f Taught by Balaji Raghavachari (RBK)
1 page
Top 5 Ways Amazon Uses CRM 1. Tailored Offers and Promotions
No ratings yet
Top 5 Ways Amazon Uses CRM 1. Tailored Offers and Promotions
3 pages
Lecture 1 Introduction by Dr. Fazeel Abid
No ratings yet
Lecture 1 Introduction by Dr. Fazeel Abid
26 pages
StratifyTrade - Volume Footprint Voids
No ratings yet
StratifyTrade - Volume Footprint Voids
5 pages
Wideband Modem Resiliency
No ratings yet
Wideband Modem Resiliency
18 pages
Book Design Assignment
No ratings yet
Book Design Assignment
120 pages
SmartLogger ModBus Interface Definitions
No ratings yet
SmartLogger ModBus Interface Definitions
50 pages
Health Care Systems Analysis
No ratings yet
Health Care Systems Analysis
5 pages
SADP_21CS741-M4
No ratings yet
SADP_21CS741-M4
30 pages
Embedded Sample Paper
100% (1)
Embedded Sample Paper
11 pages
Cisco Express Specialization Requirements
No ratings yet
Cisco Express Specialization Requirements
6 pages
Service Interface User Guide, February 2022 DOCA0170EN-02
No ratings yet
Service Interface User Guide, February 2022 DOCA0170EN-02
64 pages
Self-Quiz Unit 2 - Attempt Review
No ratings yet
Self-Quiz Unit 2 - Attempt Review
5 pages
Emachines Circuito
No ratings yet
Emachines Circuito
51 pages
Ieee Transactions On Learning Technologies, Manuscript Id
No ratings yet
Ieee Transactions On Learning Technologies, Manuscript Id
13 pages
SIMCom Company Overview_2023_Q4
No ratings yet
SIMCom Company Overview_2023_Q4
36 pages