Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
343 views

Spam Detection Using Machine Learning

Spam Detection using Machine Learning

Uploaded by

Laxman Bharate
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
343 views

Spam Detection Using Machine Learning

Spam Detection using Machine Learning

Uploaded by

Laxman Bharate
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2018 23rd International Scientific-Professional Conference on Information Technology (IT) Žabljak, Montenegro

Review Spam Detection using Machine


Learning
Draško Radovanović, Božo Krstajić, Member, IEEE

 group spam are not well researched, they are not presented
Abstract --- Prior to buying a product, people usually inform in this paper. Description of data used in spam detection is
themselves by reading online reviews. To make more profit given in section II. Frequently used techniques and their
sellers often try to fake user experience. As customers are
results are presented in section III. Experimental results are
being deceived this way, recognizing and removing fake
reviews is of great importance. This paper analyzes spam given in section IV.
detection methods, based on machine learning, and presents
their overview and results. II. REVIEW SPAM DESCRIPTION
Review spam can be divided in three groups as proposed
Keywords — machine learning, review spam detection. in [4]:
1. Untruthful opinions
2. Reviews on brands only
I. INTRODUCTION
3. Non-reviews

M ACHINE learning is a field of computer science that


allows computers to learn from data without being
explicitly programmed. Supervised learning, a subfield of
Untruthful opinions represent purposefully fake reviews.
Reviews on brands only aren't focused on products, but
rather on brands or manufacturers. Non-reviews include
machine learning, needs labeled data to be able to learn. advertisements or other irrelevant reviews containing no
Data is labeled by human experts or some system whose opinions. Although types two and three fail to address
behavior should be mimicked. During the training process, specific products, they aren't fraudulent. These types of
algorithm tries to find relationship between input (data) spam are also easy to spot manually and traditional
and output (labels). After the training, system can be used classification approaches have no problem in detecting
on unlabeled data. Algorithms used by methods in this them. Untruthful reviews are shown to be much harder task
paper belong to supervised learning algorithms. for a machine as well as for a human observer. For those
As Internet continues to grow, online reviews are reasons this type of spam is considered in this paper.
becoming more relevant source of information. Knowing Data used in techniques for review spam detection can
that products’ success depends on customer reviews, be categorized in three types of data [5]:
sellers often try to deceive buyers by posting fake 1. Review content
comments. Sellers can post reviews themselves or pay 2. Meta-data about the review
other individuals to do it for them. This practice of posting 3. Product information
fraudulent reviews is known as opinion or review spam. Review content represents the actual text of the review.
Spammers can be hired to post positive reviews, or to Linguistic features used in detecting fraudulent behavior
write bad reviews to damage competitors’ business. can be extracted from the review content. Those include
Canadian Competition Burau issued an official warning to POS (position of speech) and word n-grams, as well as
their citizens in 2014, stating that they should be aware of other semantic and syntactic clues. The problem with using
fraudulent reviews and estimating that third of reviews solely review content is that it is easy to create a fake
found online are fake [1]. Poll taken on over 25 000 review that is like a genuine one.
participants in 2009, says that over 70% consumers believe Review meta-data includes information about user id, its
online reviews [2]. This shows that spam reviews present a ratings, IP and MAC addresses of host computers,
major concern today. To tackle this problem many information about time it took to write a review, reviewer’s
methods have been proposed during the last decade. location etc. Meta-data is useful in detecting abnormal
Majority of published papers can be categorized in three behavior patterns. However, as opposed to review content
groups, based on the aim of proposed methods. Methods data, it is not available on many websites.
can be used for detecting spam reviews, individual Product information is information about merchandise
spammers, or group spam [3]. Since methods that focus on being reviewed such as number of reviews, average
ratings, product description, popularity and sales volume.
D. Radovanović is with the Faculty of Electrical Engineering, For example, if certain product is not selling well, but has
University of Montenegro, Džordža Vašingtona bb, 81000 Podgorica,
Montenegro (e-mail: draskor@ac.me).
a lot of praising reviews, that can be suspicious.
B. Krstajić is with the Faculty of Electrical Engineering, University of Data used in training and testing systems for spam
Montenegro, Džordža Vašingtona bb, 81000 Podgorica, Montenegro (e- detection comes from variety of sources. Lack of standard
mail: bozok@ac.me).

978-1-5386-3620-6/18/$31.00 ©2018 IEEE


2018 23rd International Scientific-Professional Conference on Information Technology (IT) Žabljak, Montenegro

datasets for this type of problem makes it hard to compare review content. Nice breakdown of categories is given in
results from different papers. Some of the commonly used [6]:
datasets are the ones obtained from Yelp.com [21] or by Bag of words approach considers words or sequence of
Amazon Mechanical Turk (AMT) [20]. AMT is a words used in reviews as features. Sequences of words are
crowdsourcing platform that was used by researchers in called n-grams (where n denotes number of words in a
[14] for creating a dataset of fake reviews. It is shown that sequence). Values of n=1,2,3 are the most common.
reviews created via AMT do not accurately represent real- Term frequency includes n-grams as well as the number
world data. Yelp offers real data from their site, however, of their occurrences. This additional information can
labeling is based on their own spam filters which makes improve bag of words approach.
POS tags are labels given to words based on their
this dataset unreliable. Labeling reviews manually, as in
context. This process includes tagging a word based on its
many other machine learning problems, is not helpful. The
definition and relationships with adjacent words. Words
main reason for this is that it is shown that machine
are then marked as adverbs, verbs, etc. This information is
learning techniques used today, although not sufficiently collected and with additional features fed into a machine
good, are better than humans in recognizing fraudulent learning algorithm.
reviews [6]. Many researchers have artificially created Stylometric features try to capture reviewers writing
fake reviews used in their research, which further style. They include number of punctuation marks used,
complicates comparison of their results with other papers. length of words and sentences etc.
Semantic features are focused on the meaning of words.
III. METHODS OVERVIEW They include synonyms and similar phrases. The idea of
A. Review centric methods using these features is that spammers usually replace some
words with similar ones, conveying the message, while
To the best of our knowledge, the first method for review making it harder to identify duplicate reviews.
spam detection was proposed in 2007 by Jindal and Liu LIWC software [13] is also commonly used in feature
[8]. This paper was followed by [9] and [4] in which initial engineering for spam detection. LIWC analyses text and
ideas were further investigated. In [4] method for detecting groups words in over 80 topical, linguistic and
spam based on review duplication was proposed. The data psychological categories. Including LIWC output along
was taken from amazon.com and included 5.18 million with other features has been shown to improve results.
reviews and 2.14 million reviewers. They used a shingle Review meta-data analysis includes information such as
method [10] for detecting near-duplicate reviews. Authors review length, duration of writing, reviewer id etc. These
calculated a similarity score and labeled the ones who had features are used both in review centric and reviewer
a score over 90% as duplicates. After that 36 features centric methods.
describing reviews, reviewers and product information In [14] researchers developed three methods for
were extracted. They tried Naive Bayes, support vector detecting spam. They used content-based approach
machine (SVM) and logistical regression. Naive Bayes and achieving almost 90% accuracy on their dataset. For
SVM yielded bad results. Using logistical regression, they feature extraction they tried POS, LIWC output and n-
obtained AUC (Area Under ROC Curve) score of 0.63 grams, as well as combining their results. Classification
using only text features and 0.78 using all features, algorithms used were SVM and Naive Bayes. Using these
showing that more features than just a review content is features SVM outperformed Naïve Bayes.
needed for a good classification model
Researchers observed that many spammers copy existing Classifier Features Accuracy
reviews entirely or change only a few words. Therefore, SVM POS 73.0%
many researchers in this area have focused on methods for
LIWC 76.8%
duplicates detection. These methods are used for finding
Unigrams 88.4%
textual or conceptual similarity between reviews. In [11]
Trigrams 89.0%
authors used conceptual similarity. They used data
Bigrams 89.6%
collected from a digital camera page and extracted its main
features (photo quality, design, zoom, size etc.). Then for LIWC+bigrams 89.8%
each review they extracted which features were mentioned Human Observer 1 61.9%
and in which context. Using those features they calculated Observer 2 56.9%
similarities between reviews and compared them. Using Observer 3 60.6%
labels obtained by two human observers as ground truth, Table 1: Accuracy of methods tested in [14]
their method achieved only 43.6% accuracy. Method for Results in table 1 show that combination of bigrams and
measuring text similarity based on Kullback-Leibler was LIWC yields best results (although only 0.2% better than
proposed in [12]. They used SVM algorithm for using only bigrams). Human observers performed poorly,
classification and reported similar results as [4]. achieving accuracy of around 60%. Authors created
Detecting spam using similarity between reviews can be publicly available dataset with 400 genuine and 400 fake
a useful technique. However, it should be noted that reviews. Fake reviews in their dataset were obtained via
spammers often copy genuine reviews. Using these AMT, while genuine reviews were obtained from
techniques both genuine and fake review would be TripAdvisor. All reviews in this dataset represent positive
classified as spam. Besides duplicates detection, there are reviews. In [15] the same authors tackled review spam that
many other types of techniques in detecting spam by using contained negative reviews. They collected 400 truthful
2018 23rd International Scientific-Professional Conference on Information Technology (IT) Žabljak, Montenegro

reviews from six websites, and the same number of fake Early rating deviation: When product is published,
reviews was collected using AMT. N-grams were used as sellers try to promote it from the very start to get attention.
features and SVM as a classification algorithm. Achieved Because of that, spammers are the most active right after
score was 86% accuracy. They also tested how system the product is published Calculating average rating of a
trained on negative reviews behaved when given positive product and using two features: review rating and weights
reviews and vice versa. System trained on negative reviews of that rating indicating how early it was given, researches
had 81.4% accuracy when tested on positive reviews, and in [19] showed that it is possible to use these features in
75.1% in the other case. System trained on all samples detecting spam reviews.
gave 88.4% accuracy in the case of positive reviews and Maximum content similarity: This feature is based on the
86% in the case of negative reviews. Like in [14], three fact that spammers usually post the same reviews multiple
human observers were given task to try to detect which times making only small changes. Using cosine similarity
reviews were fake. Their results were 65%, 61.9% and between reviews of the same author, it was shown that over
57.5% accuracy, showing that automatic methods 70% of spammers achieve score of 0.3 or higher, while
outperform humans by huge margin. 30% of non-spammers achieve cosine score over 0.18.
Dataset used in [14] and [15] contains fake reviews Researchers in [18] used behavioral features on Yelp
written for the purpose of these researches, and not from dataset. They used before described Maximum number of
real websites. This includes bias and does not represent reviews (MNR), Positive review percentage (PR), Review
real-world data which was observed by researchers in [16]. length (RL), Reviewer deviation (RD) and Maximum
Authors tested methods presented in [14] on real data from content similarity (MCS). They also used n-grams as
Yelp.com, as well as on AMT dataset. Using AMT dataset proposed in [14]. Classification was done using SVM and
they achieved 88.8% accuracy, while on Yelp dataset 5-fold cross-validation. Using bigrams on hotel reviews
algorithm correctly classified only 67.6% reviews. they achieved accuracy of 64.4%. Behavior features (BF)
Furthermore, researchers in [17] created a method for yielded 83.2%, while combination of bigrams and BF got
synthetizing fake reviews from genuine ones. Authors 84.8%. These results show that on Yelp dataset methods
claim that even the best algorithms in spam detection had using behavior features achieve much better results than
error rate higher than 30%. Method presented in [14] was content-based methods. Authors also tested the effect of
tested on these reviews and achieved accuracy of only excluding one feature from proposed method.
59.5%, as opposed to 89.8% reported on AMT dataset.
These results show that data in AMT dataset is flawed and Excluded feature Accuracy
can’t be used to obtain reliable results. MNR 83.1%
B. Reviewer centric methods PR 80.1%
RL 79.7%
Spammers often have similar behavior patterns which RD 84%
can make their detection easy. Spammer behavior is
MCS 82.9%
analyzed, and some useful features are extracted and
Table 2: Accuracy of proposed method excluding one
described in [18] and [19]:
feature
Daily Number of Reviews: Large number of reviews
This shows that omitting one behavioral feature does not
written in one day by a single user is indication of a
significantly affect accuracy results.
spammer. Most spammers (75%) write more than 5
In [19] authors used behavioral features Early rating
reviews a day, while 90% of non-spammers write 3 or less
deviation, Reviewer deviation as well as Targeting
reviews per day, and 50% writes one review per day.
products and Targeting groups. Targeting products feature
Positive Review Percentage: Positive reviews are
is based on checking similarities between different reviews
defined as reviews with four or five stars rating. Analyzing
of the same reviewer. Targeting groups feature is based on
data from non-spammers, it was shown that percentage of
the fact that sometimes spammers write multiple reviews
positive reviews was uniformly distributed among users.
for products of one manufacturer during a short period of
On the other hand, about 85% of spammers had 80% or
time. Dataset was downloaded from Amazon website and
more positive reviews.
logistic regression model was used for training. Methods
Review Length: As spammers are paid by number of
were evaluated using human judgement, which was shown
spam messages they post, they tend to write shorter
to be unreliable by many papers. Nevertheless, this paper
reviews to maximize their profit. The average review
identified some important characteristics of spammers’
length of 92% of regular users is over 200 words while
behavior which can be further investigated.
only 20% of spammers write reviews over 135 words.
Behavioral methods, although not as well researched as
Reviewer deviation: Considering that spammers are
content-based techniques, were shown to be a powerful
usually giving product ratings that are either high or low, it
tool in spam detection. Beside the ones described in this
is expected that their ratings are different than average
paper, there are many other features collected by websites
ratings. In [18], authors calculated absolute rating
that could be used to improve accuracy of spam detection
deviation of a review from other reviews of the same
systems. Some of them are IP and MAC addresses of a
product. Then they calculated expected ratings deviation
host computer, its geo-location, click behaviors etc.
for users across all their reviews. Approximately 70% of
However, these features are intended for internal use and
non-spammer users had deviation less then 0.6, while over
are rarely available to researchers, so their contribution to
80% of spammers had deviation greater than 2.5.
spam detection algorithms is yet unknown.
2018 23rd International Scientific-Professional Conference on Information Technology (IT) Žabljak, Montenegro

IV. EXPERIMENTAL RESULTS V. CONCLUSION


In practice, most websites use proprietary algorithms for In this paper a brief overview of spam detection
spam detection. Decision making process is usually done methods published during the last decade was presented. It
with combination of spam filtering software and human was shown that using different datasets yields extremely
analysis. As software for detection of untruthful reviews is different results. Moreover, the lack of a proper gold
not publicly available, results obtained by using a popular standard dataset was recognized as a major problem in
program for detecting bot spam are presented. spam detection. Although linguistic approaches dominate
Akismet [22] is the most popular plugin for spam in number of research papers, spammer detection
detection in blog comments. Its reported stats as of today techniques have shown promising results. Therefore, future
are over 400 billion removed spam messages and accuracy research should be focused on combining content-based
over 99%. Although its algorithm is not publicly available, and reviewer-based methods for achieving the best results.
some information about its decision-making process is
known. When comment is posted, Akismet compares its
content to known spam messages in its database. If match LITERATURE
is found, that comment is marked as spam. [1] http://www.competitionbureau.gc.ca/eic/site/cb-
bc.nsf/eng/03782.html
Setup for this experiment included creating five [2] Nielsen blog, July 7, 2009. Global advertising: Consumers trust real
WordPress websites on a local machine and installing friends and virtual strangers the most.
Akismet on each of them. To post a comment, unlogged http://blog.nielsen.com/nielsenwire/consumer/global-advertising-
user is required to enter his name and email address. Six consumers-trust-real-friends-andvirtual-strangers-the-most
[3] A. Heydari, Mhd. A. Tavakoli, N. Salim, and Z. Heydari, Detection
types of comments are defined based on their content:
of review spam: A survey, Expert Systems with Applications 42,
1. Comments without alphanumerical characters no. 7 (2015): 3634-3642
2. Random combination of characters [4] N. Jindal and B. Liu. 2008. Opinion spam and analysis. In
3. Random combination of words Proceedings of the international conference on Web search and web
data mining, pages 219–230. ACM
4. Common phrases
[5] B. Liu. Sentiment analysis and opinion mining. Synthesis Lectures
5. Links on Human Language Technologies, pages 1–167, 2012.
6. Text with links [6] M. Crawford, T.M. Khoshgoftaar, J.D. Prusa, A.N. Richter, H. Al
Conducted tests show that comment is marked as spam Najada, "Survey of review spam detection using machine learning
if it contains random combination of non-alphanumerical techniques", Journal Of Big Data, 2, pp. 1-24, 2015.
[7] Mukherjee A, Venkataraman V, Liu B, Glance NS (2013) What
characters (dashes, dots, commas…). yelp fake review filter might be doing? Boston, In ICWSM.
Random combination of characters or words that were [8] N Jindal, B Liu, Proceedings of the 16th international conference
tried were not marked as spam immediately. However, if on World Wide Web, 1189-1190
the same comment was posted more than five times in a [9] N. Jindal and B. Liu. Analyzing and Detecting Review Spam.
ICDM2007.
short time interval it was marked as spam in all next cases. [10] A. Z. Broder. On the resemblance and containment documents. In
Common phrases and links were posted tens of times Proceedings of Compression and Complexity of Sequences 1997,
consecutively without being marked as spam. Common IEEE Computer Society, 1997
phrases are defined as short sentences or combination of [11] Algur et al. (2010). Conceptual level similarity measure based
review spam detection. In International conference on signal and
words that are likely to be found on web. image processing (pp. 8)
Comments containing text with links were marked as [12] Lai et al. (2010). Toward a language modeling approach for
spam if posted more than five times in a relatively short consumer review spam detection. In IEEE 7th international
amount of time. For example, comments conference (pp. 8)
[13] Pennebaker et al. (2007). The development and psychometric
“http://www.example.com” and “Example test” were not properties of LIWC2007. Austin, TX, LIWC.Net.
marked as spam when posted tens of times. However, [14] Ott et al. (2011). Finding deceptive opinion spam by any stretch of
comment “Example test: http://www.example.com” was the imagination. In 49th annual meeting of the association for the
marked as spam when posted over five times computational linguistics (pp. 11).
[15] M. Ott, C. Cardie, & J. T. Hancock, Negative deceptive opinion
consecutively. spam, Proceedings of the 2013 conference of the north american
Experiment included posting these comments multiple chapter of the association for computational linguistics: human
times on the same website as well as on different websites. language technologies, June 2013 (pp. 497–501).
Comments on different websites were posted using the [16] Mukherjee et al. (2013). Fake Review Detection: Classification and
Analysis of Real and Pseudo Reviews, UIC-CS-03-2013. Technical
same email address, as well as different email addresses.
Report.
Further, cases when times between posting comments were [17] Sun et al. (2013). Synthetic review spamming and defense. In
five seconds, one minute and ten minutes were tested. It Proceedings of the 19th ACM SIGKDD international conference on
was shown that in all these cases the same results were knowledge discovery and data mining. ACM.
obtained. After a comment is marked as spam, using slight [18] Mukherjee et al. (2013). What yelp fake review filter might be
doing. In Seventh international AAAI conference on weblogs and
variations of it will also be blocked by Akismet. If one social media.
comment of a user was marked as spam manually, all [19] Lim et al. (2010). Detecting product review spammers using rating
further comments posted on the same website by that user behaviors. In Proceedings of the 19th ACM international
were marked as spam automatically. However, other users conference on Information and knowledge management. ACM
[20] http://myleott.com/op-spam.html
were still able to post comments with that same content. [21] https://www.yelp.com/dataset
This shows that manual labeling spam comments is used [22] https://akismet.com/
for detecting spam users, rather than determining if
comment’s content represents spam.

You might also like