Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
23 views

Main

Uploaded by

raviyallampati
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Main

Uploaded by

raviyallampati
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 43

FAKE REVIEWS DETECTION BASED ON INDEX OPTIMIZATION

A Main Project Report submitted to


JNTUA, Ananthapuramu

In partial fulfillment of the requirements for the award of the degree of


Bachelor of Technology

(INFORMATION TECHNOLOGY)
By
A. SANTHOSH (20KB1A1201) K. VENKATESWARLU (20KB1A1230)
V. NITHIN (20KB1A1259) Y. RAVI (20KB1A1261)
N. JAGADEESH (21KB5A1205)

Under the esteemed guidance of

Mr. M. SIVAPRATAP REDDY


Assistant Professor, Dept. of IT and AI&DS

DEPARTMENT OF IT and AI&DS


N.B.K.R INSTITUTE OF SCIENCE & TECHNOLOGY
(Autonomous)
VIDYANAGAR – 524 413, TIRUPATI DIST, AP
MAY 2024

1
Dept. of IT and AI&DS,NBKRIST
Website: www.nbkrist.org Email: ist@nbkrist.org
Ph: 08624-228 247 Fax: 08624-228 257

N.B.K.R. INSTITUTE OF SCIENCE & TECHNOLOGY


(Autonomous)
(Approved by AICTE: Accredited by NBA: Affiliated to JNTUA, Ananthapuramu)

An ISO 9001-2000 Certified Institution

Vidyanagar -524 413, Tirupati District, Andhra Pradesh, India

BONAFIDE CERTIFICATE
This is to certify that the Project work entitled “FAKE REVIEWS DETECTION BASED
ON INDEX OPTIMIZATION” is a bonafide work done by A. Santhosh (20KB1A1201),
K. Venkateswarlu (20KB1A1230), V. Nithin (20KB1A1259), Y. Ravi (20KB1A1261), N.
Jagadeesh (21KB5A1205), and in the department of IT and AI&DS, N.B.K.R. Institute of
Science & Technology, Vidyanagar and is submitted to JNTUA, Ananthapuramu in the
partial fulfillment for the award of B.Tech degree in Information Technology. This work
has been carried out under my supervision.

Mr. M. SIVA PRATHAP REDDY Dr. A. NARAYANA RAO


Assistant Professor Professor & Head
Department of IT and AI&DS Department of IT and AI&DS
N.B.K.R.I.S.T, Vidyanagar N.B.K.R.I.S.T, Vidyanagar

Submitted for the Viva-Voce Examination held on ____________

Internal Examiner External Examiner

2
Dept. of IT and AI&DS,NBKRIST
TABLE OF CONTENTS

CHAPTER NO CHAPTER NAME PAGE NO

BONAFIDE CERTIFICATE

LIST OF FIGURES

LIST OF TABLES

AKNOWLEDGEMENT

ABSTRACT

1 INTRODUCTION

1.1 INTRODUCTION

1.2 MOTIVATION

1.3 PROBLEM STATEMENT

1.4 OBJECTIVES AND SCOPE

1.5 ORGANIZATION OF THE PROJECT

1.6 SUMMARY

2 LITERATURE SURVEY

2.1 EXISTING SYSTEM

2.2 SURVEY

2.3 PROPOSED SYSTEM


3
Dept. of IT and AI&DS,NBKRIST
3 METHODOLOGY

3.1 INTRODUCTION

4 IMPLEMENTATION

4.1 INTRODUCTION

4.2 MODEL TRAINING

4.3 CREATE AN UI

4.4 REQUIREMENTS

5 RESULTS AND ANALYSIS

5.1 PIE CHART OF DATASET

5.2 HISTOGRAM OF HAM REVIEWS

5.3 HISTOGRAM OF SPAM REVIEWS

5.4 CONFUSION MATRIX

5.5 NAVIE BAYES PREDICTION

5.6 SVM PREDICTION

5.7 ACCURACY OF NAVIE BAYES

5.8 ACCURACY OF SVM

5.9 CLASSIFICATION REPORT OF NB

5.10 CLASSIFICATION REPORT OF SVM

5.11 USER INTERFACE

5.12 PREDICTION SPAM & HAM

6 CONCLUSION AND FUTURE ENHANCEMENTS

6.1 CONCLUSION

4
Dept. of IT and AI&DS,NBKRIST
6.2 FUTURE ENHANCEMENTS

7 REFERENCES

5
Dept. of IT and AI&DS,NBKRIST
LIST OF FIGURES

FIGURE NO FIGURE NAME PAGE NO

1 COMPARISION OF DIFFRENT
MODELS
2 NAÏVE BAYES ACCURACY
MODEL
3 SVM CLASSIFICATION REPORT
4 SYSTEM ARCHITECTURE
5 DATASET
6 CONFUSION MATRIX
7 PIE CHART
8 HISTOGRAMS
9 API
10 RESULT

6
Dept. of IT and AI&DS,NBKRIST
ACKNOWLEDGEMENT
The satisfaction that accompanies the successful completion of a project would be
incomplete without the people who made it possible of their constant guidance
and encouragement crowned our efforts with success.

We would like to express our profound sence of gratitude to our project guide Mr. M.
SIVA PRATHAP REDDY, Associate Professor, Department of IT and AI&DS,
N.B.K.R.I.S.T (affiliated to JNTUA, Ananthapuramu), Vidyanagar, for his masterful
guidance and the constant encouragement throughout the project. Our sincere thanks for
him suggestions and unmatched services without, which this work would have been an
unfulfilled dream.

We convey our special thanks to Dr. Y. Venkata Rami Reddy respectable chairman of
N.B.K.R. Institute of Science and Technology, for providing excellent infrastructure in
our campus for the completion of the project

We convey our special thanks to Sri N. Ram Kumar Reddy respectable correspondent
of N.B.K.R. Institute of Science and Technology, for providing excellent
infrastructure in our campus for the completion of the project.

We are grateful to Dr. V. Vijaya Kumar Reddy, Director, of N.B.K.R Institute of


Science and Technology for allowing us to avail all the facilities in the college.

We express our sincere gratitude to Dr. A. Narayana Rao, Professor, Head of


Department, IT and AI&DS, for providing exceptional facilities for successful
completion of our project work.

We would like to convey our heartful thanks to Staff members, Lab technicians, and
our friends, who extended their cooperation in making this project as a successful
one.

We would like to thank one and all who have helped us directly and indirectly to
complete this project successful

7
Dept. of IT and AI&DS,NBKRIST
ABSTRACT

Buying and selling the goods and services through internet called as electronic network
known to be E-commerce. Due to the convenience of e-commerce, the number of users are
increased. Meanwhile, the people review of product also increased. In e-commerce websites,
fake review is often the major problem. Nowadays, it is known to be common that user can
write the review for their purchased product.

There are many ways that user can write reviews. Using this opportunity, there is a possibility
that spammers can leave fake review. Many users determine the quality of product based on
user’s reviews. So, the fake review creates lot of problems on product quality, sales, and
economic growth. To tackle this problem, we are going to use Naive Bayesian classifier and
svm classifier which are very simple and easy technique to classify the product review.
Feature extraction can be used to extract the feature . Here we are using dataset for
classifying the product reviews. Here our aim is to find fake reviews. By detecting fake
reviews the accuracy of e-commerce system can be improved.

8
Dept. of IT and AI&DS,NBKRIST
1. INTRODUCTION

1.1 INTRODUCTION

The way the people express their opinions and communicate with others on the web is
radically changed. Nowadays, customers rely a lot on the written reviews before going to
purchase the product. There are basically two types of spam reviews. First type is
“positive/negative reviews” and second type is” no reviews”. The positive/negative reviews
of product gives opinion on their product selection. So, positive review makes good opinion
about the product as well as increases the quality of product while negative review makes
bad impression about product as well as destroys the reputation of product. Second type, no
reviews (e.g. Ads) has no opinion on the product. It is also a kind of
encouragement/discouragement for customer to buy the product. The growth of e-commerce
sites also increase the number of spam reviews. 20% of reviews on e-commerce websites
were actually fake. In order to detect the fake reviews we are going to utilize feature
selection, extraction techniques and classification methods. There are several algorithms for
classifying the spam and non-spam data such as vector machine, decision tree, Naïve Bayes,
neural network are well-known classifiers . For process and analyze the data set, we first
apply feature extraction technique over the selected attributes. Overall knowledge of various
classifier, we find that the accuracy can be improved using Naïve based and svm
classification techniques. The performance of Naïve based and svm classifiers were found to
be better than other classifiers.

1.2 MOTIVATION

The motivation for implementing this spam reviews detection project “Fake reviews
detection using index optimization” is to improve maintaining trust, fairness and authenticity
in online reviews, which ultimately leads to better user experiences and brand reputation.
Reviews play a crucial role in shaping consumer decisions. Spam or fake reviews can
undermine trust in a product or service, leading to potential loss of customers and reputation
damage for businesses. Authenticity is key in online platforms. Spam reviews distort the
overall rating and mislead potential customers. Detecting and removing spam reviews ensure

9
Dept. of IT and AI&DS,NBKRIST
a fair representation of products or services. Spam reviews clutter the review space and make
it difficult for users to find relevant information. Removing spam helps in reducing noise and
improving the overall quality of reviews. For businesses, their online reputation is invaluable.
Spam reviews can tarnish a brand's image. Detecting and removing spam helps protect the
brand's reputation and integrity.

1.3 PROBLEM STATEMENT

Given a dataset of user-generated reviews for products or services, the objective is to develop
a machine learning model that can accurately classify reviews as either spam or genuine.
Spam reviews are defined as those that are fake, deceptive, or manipulative in nature, aiming
to influence the perception of the product or service unfairly. The model should be able to
effectively distinguish between legitimate reviews and spam reviews, thus helping maintain
the integrity and trustworthiness of the review platform.

1.4 OBJECTIVES AND SCOPE

1.Devoloping a Robust Model: Create a machine learning model capable of accurately


distinguishing between spam and legitimate reviews. This involves exploring various
algorithms, feature engineering techniques, and data preprocessing methods to optimize
model performance.
2.Data Collection & Feature engineering: collecting a dataset related to product reviews
then Identify and extract relevant features from the review text, metadata (e.g., timestamp,
reviewer information), and other contextual information. These features should capture the
characteristics indicative of spam or genuine reviews.
3.Model Evaluation: Evaluate the performance of the developed model using appropriate
metrics such as accuracy, precision, recall, F1-score. Conduct thorough testing on validation

and test datasets to ensure robustness and generalization.

10
Dept. of IT and AI&DS,NBKRIST
4.Deployment&Integrate: Integrate the trained model into the existing review platform
infrastructure, either as a standalone system or as part of a larger spam detection pipeline.
Ensure seamless integration with minimal disruption to user experience.

5.Monitoring & Improvement: Implement mechanisms for monitoring model performance


in real-time and collecting feedback from users and moderators. Utilize this feedback to

iteratively improve the model's effectiveness and adapt to evolving spamming techniques .
SCOPE:
1.Review types: Focus on detecting spam reviews in text-based reviews for products,
services, or businesses across various domains (e.g., e-commerce, hospitality, restaurants).
2.Platforms: Determine the online platforms where spam reviews will be targeted for
detection. This could include e-commerce websites, review aggregation sites, social media
platforms, and any other platforms hosting user-generated reviews.
3. Spam Categories: Define the categories or characteristics of spam reviews that the
detection system will focus on. This may include fake reviews, biased reviews, reviews
posted by bots, reviews containing promotional content, or any other forms of deceptive or
manipulative content.
4.Detection Techniques: Specify the machine learning methods and algorithms to be
employed for spam detection.
5.Feature Selection: Identify the features and attributes of reviews that will be used for spam
detection. This may include textual features, metadata (e.g., reviewer text), analysis.
6.Data Sources: Utilize publicly available datasets, proprietary data from the review
platform (if available), and synthetic data generation techniques to augment the training
dataset.
7.Model Maintenance: Develop procedures for model maintenance, including version
control, retraining schedules, and performance monitoring. Establish protocols for handling
model drift and concept drift over time.

1.5 ORGANIZATION OF PROJECT REPORT

11
Dept. of IT and AI&DS,NBKRIST
Organizing a project report effectively is crucial for presenting your findings, insights, and
recommendations clearly and logically. Here's a suggested structure for organizing a project

report on spam reviews detection:

1.Title Page:
 Project title
2.Abstract:
 A brief summary of the project objectives, methods, Key findings, and Conclusions.
3.Table of Contents:
 List of sections and subsections with corresponding page numbers.
4.Introduction:
 Background and Context of the problem (Spam reviews detection.
 Motivation for the project.
 Objectives of the project.
 Overview of the report structure.
5.Literature Review:
 Reviews of existing research, methods, and technologies related to spam reviews
detection.
 Discussion of relevant theories, models, and approaches in the field.
6.Methadology:
 Description of the dataset used for training and testing.
 Explanation of data collection and preprocessing techniques.
 Overview of feature engineering methods.
 Explanation of the machine learning or detection algorithms employed.
 Details of model evaluation metrics and techniques.
7. Results:
 Presentation of experimental results, including model performance metrics.
 Visualization of key findings (e.g., confusion matrix, ROC curves).
 Discussion of any challenges encountered during experimentation.
8.Conclusion:
12
Dept. of IT and AI&DS,NBKRIST
 Summary of the key findings and insights.
 Recapitulation of the project objectives and whether they were achieved.

 Implications of the findings for the field of spam reviews detection.

9.Reference:
 Implications of the findings for the field of spam reviews detection.

1.6 SUMMARY

The way the people express their opinions and communicate with others on the web is
radically changed. Nowadays, customers rely a lot on the written reviews before going to
purchase the product. There are basically two types of spam reviews. First type is
“positive/negative reviews” and second type is” no reviews”. The positive/negative reviews
of product gives opinion on their product selection. So, positive review makes good opinion
about the product as well as increases the quality of product while negative review makes
bad impression about product as well as destroys the reputation of product. Second type, no
reviews (e.g. Ads) has no opinion on the product. It is also a kind of
encouragement/discouragement for customer to buy the product. The growth of e-commerce
sites also increase the number of spam reviews. 20% of reviews on e-commerce websites
were actually fake. In order to detect the fake reviews we are going to utilize feature
selection, extraction techniques and classification methods. There are several algorithms for
classifying the spam and non-spam data such as vector machine, decision tree, Naïve Bayes,
neural network are well-known classifiers . For process and analyze the data set, we first
apply feature extraction technique over the selected attributes. Overall knowledge of various
classifier, we find that the accuracy can be improved using Naïve based and svm
classification techniques. The performance of Naïve based and svm classifiers were found to
be better than other classifiers.

13
Dept. of IT and AI&DS,NBKRIST
2.LITERATURE SURVEY

A literature survey, also known as a literature review or literature search, is a critical


examination and synthesis of existing research and literature related to a particular topic or
subject area. It involves systematically gathering, evaluating, and analyzing published works
such as academic papers, books, articles, and reports that are relevant to the research question
or topic of interest.

2.1 EXISTING SYSTEM

The existing system that propose “Product spam reviews detection based on index
optimization” by using machine learning algorithms. In existing system they used a dataset
consists of 11 attributes called Product name, Commodity property, Positive evaluation word,
Negative evaluation word, Positive effect word, Negative effect word, Review on the length
of the text, Number of votes for review content, and Review on the users credit experience.
Steps follow to implement:
1.Load the dataset
2.Selecting the indexes by using Index optimization techniques.
3.preprocess the selected data fields.
4.Split dataset into train and test data.
5. train the naïve bayes and svm models by giving the test and train data.
6.Model evaluation facors like f1 score, recall, and precition rate.
7.Test the model by giving input reviews.

14
Dept. of IT and AI&DS,NBKRIST
Accuracy of model:

Fig:-1
DEMERITS:
 Accuracy low
 Similar data indexes
 Duplication data

2.2 SURVEY

Doing this “fake reviews detection based on index optimization” project , I was going
through old conference papers and researched.

Year Authors Title Description Results Drawbacks

2019 Ai-jun Product spam Provide overview Detected spam Accuracy of


LI,Lei SHI reviews of spam reviews reviews based on model is low
detection based and detecting index
on index methods optimization and
optimization ml algo

2020 N. Survey of fake Provide an Summarizing May not


Abdelmage reviews overview of their generalized
ed , H. tork various techniques performances overview
and and challenges in
Hussein detecting fake
reviews across
different platforms.

15
Dept. of IT and AI&DS,NBKRIST
2018 Han yutan False comment False comments Improving online Getting low
recognition recognitions based content accuracy.
on CNN. evaluation.

2017 S. What yelp fake Analyzes the yelp Offers insights Limited to the
Mukarjee, review filter fake review filter into the yelp platform
N. Glance might be doing. and proposes a challenges and and findings
framework for effectiveness of may not
understanding existing spam apply
review spam. filters. universally.

2.3 PROPOSED SYSTEM

By doing above research I proposed new system called “Fake reviews detection based on
index optimization” by using machine learning algorithms and index optimization
techniques.In proposed system we took a diffrent dataset compare with an existing model.
dataset contains 6 attributes which are category, review, rating, likes, username, label.
Steps followed to implement:
1.Load the dataset
2.Selecting the indexes by using Index optimization techniques.
3.preprocess the selected data fields.
4.Split dataset into train and test data.
5. train the naïve bayes and svm models by giving the test and train data.
6.Model evaluation facors like f1 score, recall, and precition rate.
7.create an user interface using python streamlit.
8.Test the model by giving input reviews.
While comparing with the existing model in proposed model we vary the dataset contains 8
fields by using index optimization techniques. which are gave optimized accurate
results .Also the proposed model increase the efficiency of accurate results and reduce the
noisy.
Accuracy of model:

16
Dept. of IT and AI&DS,NBKRIST
Fig:-2 Naïve Bayes Model Classification Report

Fig:-
Fig:-3 SVM Model Classification Report

MERITS:
 High Accuracy
 Accurate Predictions

17
Dept. of IT and AI&DS,NBKRIST
3. METHODOLOGY

3.1 INTRODUCTION
The methodology of spam reviews detection based on index optimization involves utilizing
various techniques to identify and filter out fraudulent or deceptive reviews from genuine
ones. Here's a breakdown of what this methodology typically involves:
1.Data collection: Data collection in spam review detection involves gathering a diverse set
of reviews from various sources, such as e-commerce websites, forums, or social media
platforms. These reviews can encompass different products, services, and domains to ensure
the model's robustness across various contexts.
Overall, effective data collection is foundational for training accurate and reliable spam
review detection models. It ensures that the model learns from a comprehensive and
representative sample of reviews, leading to better performance in identifying and filtering

out spam content.

18
Dept. of IT and AI&DS,NBKRIST
2.Data preprocessing: Data preprocessing in spam review detection involves transforming
raw review data into a format that is suitable for analysis and model training. Here's a typical
workflow for preprocessing data in this context:
 Removing special characters like alpha numeric, punctuations and symbols
 Convert all text into lowercase
 Split the sentence into tokens for further analysis
 Eleminate common stop words like(the, is, and)
 Stemming words to its root form
3.Feature selection & Extraction: Feature selection and extraction in spam review
detection involve identifying and extracting relevant information from the review data to

build effective models.

By using Index optimization techniques like select kbestand chi-sqare methods selecting the
features.
4.Model selection & Training: Model selection and training in spam review detection
involve choosing appropriate machine learning algorithms and optimizing their parameters to
build an effective spam detection system.
In model selection selecting the machine learning models which are suitable for classification
such as naïve bayes, svm, logistic regression, decision trees, and random forest algorithms.
In model training involves several steps:
 Split the dataset into train and test.
 Preprocess datasets.
 Train the model
5.Model evaluation: Model evaluation in spam reviews detection involves assessing the
performance of the trained model to ensure its effectiveness in distinguishing between spam
and genuine reviews.
Accuracy: Accuracy measures the overall correctness of the model and is calculated as (TP +
TN) / (TP + TN + FP + FN).
Precision: Precision measures the proportion of correctly classified spam reviews among all
reviews classified as spam and is calculated as TP / (TP + FP).

19
Dept. of IT and AI&DS,NBKRIST
F1-score: F1-score is the harmonic mean of precision and recall and is calculated as 2 *

(precision * recall) / (precision + recall). It provides a balanced measure of a model's


performance, especially when dealing with imbalanced datasets.
Recall: Recall measures the proportion of correctly classified spam reviews among all actual
spam reviews and is calculated as TP / (TP + FN).
6.Testing model: Testing the model by giving input review to the trained model and predict
the review is legitimate or spam.
7.Deployment: Deploy the model into any platform.

Model integration

Testing model

Deployment
20
Dept. of IT and AI&DS,NBKRIST
SYSTEM ARCHITECTURE

Data Collection

Data preprocessing & Extraction

Feature selection & Extraction

Model selection & Training

Model evaluation

Testing model

Deployment

Fig:-4 Architecture

21
Dept. of IT and AI&DS,NBKRIST
4. IMPLEMENTATION

4.1 INTRODUCTION
Implementing a spam review detection system involves several steps, including data
collection, preprocessing, feature extraction, model training, and evaluation. Here's a high-
level overview of how you can approach each step:

Data Collection: Collecting reviews dataset from e-commerce websites are any other free
sources.

For this project collected the reviews dataset related products from below website link then
added some additional attributes to that.

Dataset Source: https://pythongeeks.org/fake-product-review-detection-using-machine-


learning/

Fig:-5 Dataset
Preprocess: Pre-processing involves in data cleaning for analysis such as removing
duplications, removing stop words, converting text into lower case, tokenization, removing
punctuations, stemming.

22
Dept. of IT and AI&DS,NBKRIST
Feature selection&extraction: Feature selection&extraction is the process of selecting
indexes which are giving accurate results based on weights or scores.

SelectKBest: is a method for feature selection that selects the top k features with the highest
scores based on a specified criterion. The criterion could be a statistical measure like mutual
information, F-score, or chi-square.

Chi-square: is a statistical test used to determine the association between categorical


variables. In the context of feature selection, chi-square can be used to assess the
independence between each feature and the target variable. Features with higher chi-square
values are considered more informative for predicting the target variable.

TF-IDF Score: TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical


statistic used to reflect the importance of a word in a document relative to a collection of
documents (corpus). It's commonly used in natural language processing and information
retrieval for tasks like text classification, clustering, and search engine ranking.

Here’s abreakdown of how TF-IDF is calculated:

 Term Frequency(TF): It measures how frequently a term occurs in a document. It is


calculated as the ratio of the number of times a term appears in a document to the
total number of terms in the document. It helps to highlight the significance of terms
within individual documents.

TF(t,d) = no.of times term t appears in document d / total no.of terms in document d

 Inverse Document Frequency(IDF): It measures the importance of a term across the


entire corpus by penalizing terms that occur frequently across documents. It is
calculated as the logarithm of the ratio of the total number of documents to the
number of documents containing the term, plus one to avoid division by zero for
terms that do not appear in the corpus.

IDF(t, d) = log(no .of documents in corpus / no.of documents contain the term)

Numerical Example

Imagine the term 𝑡 appears 20 times in a document that contains a total of 100 words.
Term Frequency (TF) of 𝑡 can be calculated as follow:

23
Dept. of IT and AI&DS,NBKRIST
𝑇𝐹=20/100=0.2

Assume a collection of related documents contains 10,000 documents. If 100 documents


out of 10,000 documents contain the term 𝑡, Inverse Document Frequency (IDF) of 𝑡 can
be calculated as follows

𝐼𝐷𝐹=𝑙𝑜𝑔(10000/100)=2

Using these two quantities, we can calculate TF-IDF score of the term 𝑡 for the
document.

TF-IDF=0.2∗2=0.4

The TF-IDF represents score to every term in sentence like below vector matrix form.

24
Dept. of IT and AI&DS,NBKRIST
4.2 MODEL TRAINING:

Implemented Algorithms:

1.Naïve Bayes

Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes


theorem and used for solving classification problems.It is mainly used in text
classification that includes a high-dimensional training dataset.Naïve Bayes Classifier is one
of the simple and most effective Classification algorithms which helps in building the fast
machine learning models that can make quick predictions.It is a probabilistic classifier,
which means it predicts on the basis of the probability of an object.

Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.The Naïve Bayes algorithm is comprised of two
words Naïve and Bayes, Which can be described as:

Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases
of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple.
Hence each feature individually contributes to identify that it is an apple without depending
on each other.

Bayes: It is called Bayes because it depends on the principle of Bayes Theorem.

25
Dept. of IT and AI&DS,NBKRIST
Bayes' Theorem:
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.

The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Comparing bayes theorem with our dataset:

Given a dataset with features X = (x1, x2, x3,…,xn) and a target variable y , where xi
represents the value of the ith feature, and y represents the class label, the Naive Bayes
algorithm calculates the probability of each class given the features using Bayes' theorem:

P(y∣x1,x2,...,xn)=P(x1)×P(x2)×...×P(xn)P(y)×P(x1∣y)×P(x2∣y)×...×P(xn∣y)
p(x1)*p(x2)*….*p(xn)

During training, the algorithm calculates the prior probability of each class 𝑃(𝑦) and the
conditional probability of each feature given the class 𝑃(𝑥𝑖∣𝑦) based on training data.

1.Prior probability : Calculate the prior probability of each class (positive and negative).

26
Dept. of IT and AI&DS,NBKRIST
P(Positive) = no . of positive reviews / total no . of reviews

P(Nagitive) = no.of negative reviews /total no.of reviews

2.conditional Probability:

 Calculate the conditional probability of each word given the class.

Foe example: Good Mobile

 Calculate p(Good/Positive), p(Mobile/Positive) ..etc


 These probabilities can be calculated using maximum likelihood estimation or
smoothing techniques like Laplace smoothing.

There are three types of Naive Bayes Model, which are given below:

Gaussian: The Gaussian model assumes that features follow a normal distribution. This
means if predictors take continuous values instead of discrete, then the model assumes that
these values are sampled from the Gaussian distribution.

Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular
document belongs to which category such as Sports, Politics, education,etc.The classifier
uses the frequency of words for the predictors.

Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular word is
present or not in a document. This model is also famous for document classification tasks.

Creating Confusion Matrix:

27
Dept. of IT and AI&DS,NBKRIST
Model Prediction :

New review: "Mobile condition is good"

Tokenization:

Tokenize the review into individual words: ["Mobile", "condition", "is", "good"].

Calculate probabilities:

For each class (positive and negative), calculate the probability of the review belonging to
that class using Bayes' theorem:

P(Positive∣review)∝P(Positive)×P("Mobile"∣Positive)×P("condition"∣Positive)×…
𝑃(Negative∣review)∝𝑃(Negative)×𝑃("Mobile"∣Negative)×𝑃"condition"∣Negative)×…

Note: The symbol ∝ means "proportional to."

Normalize probabilities: normalize the probabilities so that they sum to 1.

Prediction: Select the class (positive or negative) with the highest probability as the
predicted sentiment for the review.

28
Dept. of IT and AI&DS,NBKRIST
2.Support Vector Machine

Support Vector Machine (SVM) is a powerful supervised learning algorithm used for
classification and regression tasks. In the context of text classification, SVM is often used
for tasks like spam detection. SVM works by finding the optimal hyperplane that separates
data points of different classes with the largest possible margin. Here's an in-depth
explanation of how SVM works:

Linear svm : Consider a binary classification problem where we have two classes: positive
(+1) and negative (-1). SVM aims to find a hyperplane that best separates the data points of
these two classes in feature space.

Margin: The margin is the distance between the hyperplane and the nearest data point from
either class. SVM aims to maximize this margin because a larger margin typically results in
better generalization to unseen data.

Hyperplane: In a two-dimensional feature space, a hyperplane is a line. In higher


dimensions, it becomes a hyperplane. The equation of a hyperplane in a 𝑑-dimensional space
is given by:

Wtx+b=0

Where,

W = weight vector

X = input feature vector

29
Dept. of IT and AI&DS,NBKRIST
B = bias term

Objective function:The objective of SVM is to find the hyperplane that maximizes the
margin while minimizing the classification error. This can be formulated as an optimization
problem.

Minimize ½ ||w||2

Subject to:

Yi(wT xi +b) >= 1 for all training examples (xi,yi)

Where,

(xi,yi) are the training examples.

yi is the class label (+1 or -1).

Let's consider a simple example of spam reviews detection using SVM. Suppose we have a
dataset of reviews labeled as spam or non-spam. Each review is represented as a feature
vector 𝑥x (e.g., TF-IDF vector) and belongs to a class 𝑦 (spam or non-spam).

Training:

• We feed the labeled training data (feature vectors and corresponding labels) into the
SVM algorithm.

• SVM learns the optimal hyperplane that separates spam reviews from non-spam
reviews with the largest margin.

30
Dept. of IT and AI&DS,NBKRIST
Prediction:

• Given a new review, we represent it as a feature vector 𝑥 new.

• We use the learned SVM model to predict the class label for the new review.

• If the value of 𝑤𝑇𝑥new+𝑏 is positive, the review is classified as spam; otherwise, it's
classified as non-spam.

Mathematical Formulas:

Hyperplane equation: 𝑤𝑇𝑥+𝑏=0

Objective function: Minimize ½ ||w||2

subject to:

𝑦𝑖(𝑤𝑇𝑥𝑖+𝑏)≥1yi(wTxi+b)≥1

Lagrangian function: 𝐿(𝑤,𝑏,𝛼)=½||w||2−∑Ni=1𝛼𝑖[𝑦𝑖(𝑤𝑇𝑥𝑖+𝑏)−1]

Dual formulation: Maximize ∑iN=1𝛼𝑖−1/2∑iN=1∑jN=1𝛼𝑖𝛼𝑗𝑦𝑖𝑦𝑗𝑥𝑖T𝑥𝑗

Subject to:

∑iN aiyi = 0

Ai > 0 for all i

Kernel function: 𝐾(𝑥𝑖,𝑥𝑗)K(xi,xj)

31
Dept. of IT and AI&DS,NBKRIST
Confusion Matrix:

Fig:-6 SVM Decision line

4.3 CREATE AN UI
4.3.1 Streamlit:

Streamlit is an open-source Python library that allows you to create interactive web
applications directly from Python scripts. It's designed to make it easy and fast for data
scientists and machine learning engineers to build and share data-driven applications without
needing to have web development skills.

Here's how Streamlit works:

Write Python Script : With Streamlit, you write your application logic in Python scripts
using simple and intuitive syntax. You can use familiar Python libraries like NumPy, Pandas,
and Matplotlib for data processing, visualization, and machine learning tasks.

32
Dept. of IT and AI&DS,NBKRIST
Declarative Programming : Streamlit uses a declarative programming model, which means
that you specify what you want to appear on the web page, and Streamlit takes care of the
underlying HTML, CSS, and JavaScript code to render the user interface.

Realtime Updates : Streamlit automatically updates the web page in real-time as you
modify your Python script. This allows you to see changes immediately as you make them,
without needing to manually reload the web page.

Widgets and Components : Streamlit provides a wide range of widgets and components
that you can use to build interactive elements in your application, such as sliders,
dropdowns, buttons, and text inputs. These widgets allow users to interact with your data and
control the behavior of your application.

Installation Of streamlit :

Using pip command Install the streamlit see below then import the streamlit

Pip install streamlit

Import streamlit as st

4.4 REQUIREMENTS

4.4.1 Hardware Requirements :

H/W Configuration:

 Processor - I3/Intel Processor


 Hard Disk - 160GB
 RAM - 8GB

S/W requirements:

 Operating System : Windows 7/8/10


 User interface design : streamlit
 IDE : PyCharm , Jupyter
 Libraries Used : Numpy , pandas, sklearn, seaborn, matplotlib,nltk,
pickle

33
Dept. of IT and AI&DS,NBKRIST
 Web Browser : Microsoft edge
 Technology : Python

5. RESULTS AND ANALYSIS

5.1 PIECHART OF DATASET

Fig:-7 Pie Chart


5.2 HISTOGRAM OF HAM REVIEWS

34
Dept. of IT and AI&DS,NBKRIST
Fig:-8 Histogram of Ham Reviews
5.3 HISTOGRAM OF SPAM REVIEWS

Fig:-8 Histogram of Spam Review

5.4 CONFUSION MATRIX


5.4.1 Naïve bayes model :

35
Dept. of IT and AI&DS,NBKRIST
Fig:-9 Confusion Matrix

5.4.2 SVM Model :

Fig:-9 Confusion Matrix


5.5 NAÏVE BAYES PREDICTION

36
Dept. of IT and AI&DS,NBKRIST
5.6 SVM PREDICTION

5.7 ACCURACY OF NAÏVE BAYES

37
Dept. of IT and AI&DS,NBKRIST
5.8 ACCURACY OF SVM

5.9 CLASSIFICATION REPORT OF NB

5.10 CLASSIFICATION REPORT OF SVM

38
Dept. of IT and AI&DS,NBKRIST
5.11 USER INTERFACE

Fig:-10 API

5.12 PREDICTION SPAM/HAM

Fig:-11 Result

39
Dept. of IT and AI&DS,NBKRIST
6. CONCLUSION AND FUTURE ENHANCEMENTS

6.1 CONCLUSION:

In conclusion, spam reviews detection is a crucial task in various domains, including e-


commerce, hospitality, and online platforms, where the authenticity and credibility of user-
generated content are paramount. Through the utilization of advanced machine learning
techniques, particularly supervised learning algorithms like Support Vector Machines
(SVM), Naïve Bayes algorithms and feature engineering strategies, effective spam review
detection systems can be developed.

The implementation of SVM models allows for the classification of reviews into spam or
non-spam categories based on their textual content, leveraging features such as TF-IDF
representations. These models are trained on labeled datasets, enabling them to learn patterns
and characteristics indicative of spam reviews, such as excessive promotional language,
irrelevant content, or deceptive practices.

spam review detection systems represent a critical component in maintaining the integrity
and trustworthiness of online platforms and services. By leveraging advanced machine
learning techniques and feature engineering methodologies, these systems can effectively
identify and filter out spam content, thereby enhancing user experience, trust, and credibility
in online communities.

6.2 FUTURE ENHANCEMENTS

The Proposed project can be further developed in multilevel classification by using


machine learning and deep learning algorithms like decision trees, random forest,
convolutional networks.

40
Dept. of IT and AI&DS,NBKRIST
7. REFERENCES

1. Product spam reviews detection based on index optimization by Ai-jun LI, Lei SHI.
Provide overview of spam reviews and detecting methods.
2. Survey of fake reviews by N. Abdelmageed , H. tork and Hussein in 2020. Provide an
overview of various techniques and challenges in detecting fake reviews across different
platforms.
3. False comment recognition by Han yutan in 2018. False comments recognitions based on
CNN.
4.LiXiao, DingShengchun. Research on the identification of spam comment information
[2013].
5. You Guirong, WuWei, Qian Yuntao. Feature extraction method of spam review detection
in e-commerce [2014].

41
Dept. of IT and AI&DS,NBKRIST
42
Dept. of IT and AI&DS,NBKRIST
43
Dept. of IT and AI&DS,NBKRIST

You might also like