Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

E-Mail Spam Filtering: A Review of Techniques and Trends: January 2018

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/320703241

E-Mail Spam Filtering: A Review of Techniques and Trends

Chapter · January 2018


DOI: 10.1007/978-981-10-4765-7_61

CITATIONS READS

25 12,081

2 authors:

Alexy Bhowmick Shyamanta M Hazarika


Tezpur University Indian Institute of Technology Guwahati
12 PUBLICATIONS   162 CITATIONS    124 PUBLICATIONS   1,802 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Non-Parametric Scene Parsing of Natural Images View project

Design and Development of a Dynamic Firewall View project

All content following this page was uploaded by Shyamanta M Hazarika on 02 May 2018.

The user has requested enhancement of the downloaded file.


Noname manuscript No.
(will be inserted by the editor)

Machine Learning for E-mail Spam Filtering: Review,


Techniques and Trends
Alexy Bhowmick · Shyamanta M. Hazarika
arXiv:1606.01042v1 [cs.LG] 3 Jun 2016

Received: date / Accepted: date

Abstract We present a comprehensive review of the increasing dependence on e-mail has induced the emer-
most effective content-based e-mail spam filtering tech- gence of many problems caused by ‘illegitimate’ e-mails,
niques. We focus primarily on Machine Learning-based i.e. spam. According to the Text Retrieval Conference
spam filters and their variants, and report on a broad (TREC) the term ‘spam’ is - an unsolicited, unwanted
review ranging from surveying the relevant ideas, ef- e-mail that was sent indiscriminately [Cormack, 2008].
forts, effectiveness, and the current progress. The ini- Spam e-mails are unsolicited, un-ratified and usually
tial exposition of the background examines the basics mass mailed. Spam being a carrier of malware causes
of e-mail spam filtering, the evolving nature of spam, the proliferation of unsolicited advertisements, fraud
spammers playing cat-and-mouse with e-mail service schemes, phishing messages, explicit content, promo-
providers (ESPs), and the Machine Learning front in tions of cause, etc. On an organizational front, spam
fighting spam. We conclude by measuring the impact of effects include: i) annoyance to individual users, ii)
Machine Learning-based filters and explore the promis- less reliable e-mails, iii) loss of work productivity, iv)
ing offshoots of latest developments. misuse of network bandwidth, v) wastage of file server
storage space and computational power, vi) spread of
Keywords E-mail · False positive · Image spam · viruses, worms, and Trojan horses, and vii) financial
Machine learning · Spam · Spam filtering. losses through phishing, Denial of Service (DoS), direc-
tory harvesting attacks, etc.[Siponen and Stucke, 2006].
Over the couple of decades e-mail spam volume has
1 Introduction increased exponentially and is not just an annoyance
but a security threat; as it continues to evolve in its
Electronic-mail (abbreviated as e-mail ) is a fast, effec- potential to do serious damage to individuals, busi-
tive and inexpensive method of exchanging messages nesses and economies. The fact that e-mail is a very
over the Internet. Whether its a personal message from cheap means of reaching to millions of potential cus-
a family member, a company-wide message from the tomers serves as a strong motivation for amateur ad-
boss, researchers across continents sharing recent find- vertisers and direct marketers [Cranor and Lamacchia,
ings, or astronauts staying in touch with their fam- 1998]. For e.g. one of the favorite spam topics is the
ily (via e-mail uplinks or IP phones), e-mail is a pre- ‘penny stock ’ spam or the pump and dump schemes
ferred means for communication. Used worldwide by that take place over the Internet platform. Fraudsters
2.3 billion users, at the time of writing the article, e- (spammers) purchase large quantities of ‘penny stocks’
mail usage is projected to increase up to 4.3 billion ac- i.e. stocks of small, thinly traded companies, through
counts by the year-end 2016 [Radicati, 2016]. But the compromised brokerage accounts and promote them via
message boards or abroad e-mail campaign, pointing to
Alexy Bhowmick · Shyamanta M. Hazarika
School of Engineering,
the transient increase in share value. Even if a frac-
Tezpur University tion of the recipients are fooled into buying the stocks,
Tezpur, Assam, India. the spammers make a huge profit. Unwitting investors
E-mail: alexyb@tezu.ernet.in seeking higher gains believe the hype and purchase the
2 Alexy Bhowmick, Shyamanta M. Hazarika

stocks, creating higher demand and raising the price


further. Soon after the spam e-mails are sent, the fraud-
ster sells off his stocks at premium leaving the duped in-
vestors desperate to sell their own. Stock spam is just an
old trick that has made a massive comeback in the first
half of 2013. The Security Threat Report 2014 [Sophos,
2014] suggests that on some days, 50% of the overall
spam volume were ‘pump and dump’ mailings.
Fig. 1 The e-mail architecture.
According to recent reports [IBM, 2014], [Cyberoam,
2014], [Symantec, 2014], spam is being increasingly used
to distribute viruses, malware, links to phishing sites, – First, we perform an extensive evolutionary explo-
etc. An average of 54 billion spam e-mails was sent ration of the major spam characteristics, trends and
worldwide each day [Cyberoam, 2014]. Sizeable chunks spammers evasion techniques. In doing so, we under-
were that of pharmacy spam, dating spam, online prod- line some promising research directions and a few
uct purchase, diet products and online casinos spam research gaps.
[CISCO, 2014]. Another kind of spam that is rapidly – Second, we discuss feature engineering for textual
evolving is ‘Political spam’; e-mail, contrary to popular and image spam e-mails. We investigate alternate
media such as print, radio, or television provides polit- spam filtering plans based on e-mail header and non-
ical contestants an economical medium to get through content features.
to broad constituents of the electorate. Political spam – Third, we present taxonomy of content-based e-mail
is but a campaign tactic that mostly involves marketing spam filtering and a qualitative summary of ma-
for political ends or mudslinging. jor surveys on spam e-mails over the period (2004-
Spam is a broad concept that is still not completely 2015).
understood. In general, spam has many forms - chat – Fourth, we report new findings and suggest lines
rooms are subject to chat spam, blogs are subject to of future investigations into machine learning tech-
blog spam (splogs)[Kolari et al, 2006], search engines niques for emerging spam types.
are often misled by web spam (search engine spamming
The rest of the paper is organized as follows: In Sec
or spamdexing)[Gyongyi and Garcia-molina, 2005], [Shi
2, we characterize spam evolution, trends, spam causes
and Xie, 2013], while social systems are plagued by so-
and their counter measures. In Sec 3, we discuss corpus
cial spam [Lee et al, 2010a]. This paper focuses on ‘e-
pre-processing, feature extraction, feature selection and
mail spam’ and its variants, and not ‘spam’ in general.
analysis of header and non-content features. In Sec 4,
Prior attempts to review e-mail spam filtering using
we review spam filtering techniques employed prior to
Machine Learning have been made, the most notable
Machine Learning. Section 5 offers details on Machine
ones being [Androutsopoulos et al, 2006], [Carpinter
Learning algorithms applied successfully to textual and
and Hunt, 2006], [Blanzieri and Bryl, 2008], [Cormack,
multimedia content of spam e-mails. Special attention
2008], [Guzella and Caminhas, 2009], and Wang et al
is given to recent techniques. Section 6 overviews stan-
[2013]. We extend earlier surveys by taking an updated
dard evaluation measures and publicly available e-mail
set of works into account. We consider e-mail header
spam, image spam and phishing e-mail corpuses. Fi-
analysis and analysis of non-content features, which
nally, Section 7 outlines future research trends.
were not discussed in the fairly recent overviews by
[Guzella and Caminhas, 2009] and [Wang et al, 2013],
who have performed topic modeling instead. We also
1.1 E-mail and Spam Filters
present a content analysis of the major spam-filtering
surveys over the period (2004-2015). Significant amounts
When an e-mail is sent, it enters into the messaging
of historical and recent literature, including gray litera-
system and is routed from one server to another till
ture (dissertations, press articles, technical and security
it reaches the recipients mailbox. Figure 1 depicts the
reports, web publications, etc.) were studied to report
e-mail architecture and how e-mail works. E-mail de-
recent advances and findings. We believe our survey is
pends on few primary protocols: SMTP (Simple Mail
of complementary nature and provides an inclusive sur-
Transfer Protocol) [Jonathan B. Postel, 1982], POP3
vey of the state-of-the-art in content-based e-mail spam
(Post Office Protocol) and IMAP (Internet Message
filtering.
Access Protocol). The transmission details are speci-
Our work addresses the following: fied by the SMTP protocol. POP3 and IMAP are the
Machine Learning for E-mail Spam Filtering: Review, Techniques and Trends 3

most widely implemented protocols for the Mail User First the envelope sender address is sent, followed by
Agent (MUA) and are basically used to receive mes- one or more envelope recipient addresses, and finally
sages. A Message Transfer Agent (MTA) receives mails the actual message is sent. The e-mail servers actually
from a sender MUA or some other MTA and then deter- use the envelope address (not the message header ad-
mines the appropriate route for the mail [Katakis et al, dress) to deliver the e-mail to the correct recipient. The
2007]. The recipients MTA delivers the incoming mail final recipient sees only the e-mail header and body.
to the incoming mail server Mail Delivery Agent (MDA) The envelope address is one of the e-mail features that
which is basically a POP/IMAP server. MUAs (e.g. is very often abused by spammers.
Mozilla Thunderbird, Microsoft Outlook, etc.) are e-
mail clients and help the user to read and write e-mails.
Spam filters can be deployed at strategic places in both 2 Characterizing Spam Evolution
clients and servers. Many Internet Service Providers
(ISPs) and organizations deploy spam filters at the e- A couple of decades earlier spam e-mail content was
mail server level, the preferred places to deploy being at mainly textual. Therefore, spam filters analyzed only
the gateways, mail routers, etc. They can be deployed the e-mail body and header to distinguish ham (le-
in clients, where they can be installed at proxies or as gitimate e-mails) from spam e-mails. Today however,
plug-ins, as in [Irwin and Friedman, 2008]. Some spam amateur advertisers and opportunists harness addresses
filters, (e.g. SpamBayes) can be deployed at both server from chat rooms, web pages, newsgroup archives, ser-
and client levels. vice provider directories etc and send junk e-mail blindly
to millions without much cost [Androutsopoulos et al,
2006]. Anti-spam software companies and research groups
1.2 Structure of an E-mail working on spam filtering for quite some time now have
tasted limited success, mostly because spam filtering is
An e-mail comprises of two elements: body and the an adversarial classification task. In such tasks, a ma-
header. The e-mail body comprises of unstructured data licious adversary ‘poisons’ the training data with care-
such as text, HTML markup, multimedia objects and fully crafted attack techniques in order to mislead a
attachments. The header comprises trace information classifier [Jorgensen et al, 2008]. To deliver spam e-
and structured fields that are part of the message con- mail to a huge number of recipients, spammers often
tent. The Simple Mail Transfer Protocol (SMTP) [Jonathan resort to use of bulk mailing software or e-mail har-
B. Postel, 1982] defines e-mail header session to contain vesters [Blanzieri and Bryl, 2008].
fields like - the subject, senders name, e-mail ID, send- Spam evolution has been briefly discussed in sci-
ing date, routing information, timestamp, etc. for recip- entific literature [Carpinter and Hunt, 2006] [Guzella
ient information and successful delivery. Each attribute and Caminhas, 2009] [Almeida and Yamakami, 2012].
(field ) in the header has a name and specific meaning - One reason why spam is difficult to filter is because
– Received: Contains transit-related information of of its dynamic nature. The characteristics (e.g. topics,
e-mail servers, IP addresses, dates, etc. frequent terms, etc) of spam e-mail vary rapidly over
– From: Sender’s name; e-mail ID. “Name” time as spammers always seek to invent new strategies
<e-mail@example.com> to bypass spam filters. These strategies include - word
– To: Recipient’s name; e-mail ID. “Name” obfuscation, image spam, sending e-mail spam from hi-
<e-mail@example.com> jacked computers, etc. A proper understanding of the
– Return Path: Encloses an optional address specifi- spam nature and evolution can help much in the devel-
cation to be used if an error is encountered (bounce). opment of proper countermeasures. Some of the evasion
– Message ID: A single unique message identifier techniques and major trends in spam causes and char-
designated by the mail system. acteristics seen over the years are discussed below:
– X-mailer: The mail software used to create/send
the message.
2.1 Word Obfuscation
– Subject: String identifies the theme of the message
placed by the sender.
Words like ‘sex ’, ‘free’, ‘congratulations’ are good indi-
– Content type: Format of content (character set,
cators of spam and have large (‘spammy’) weights. Ini-
etc.), specified by MIME (Multipurpose Internet Mail
tial spam filters based on heuristic filtering could easily
Extensions).
detect and filter spam e-mails based on the presence of
Each e-mail message comprises of the transit-handling such obvious words. Figure 2 illustrates a word cloud of
envelope [Crocker, 2009] that is hidden from e-mail users. common words in spam e-mail [Greenberg, 2010]; the
4 Alexy Bhowmick, Shyamanta M. Hazarika

Fig. 2 A word cloud of common words in spam e-mail.

larger a word appears, the more often it has been found to defeat the feature selection process by splitting and
to occur in e-mail spam). Spammers adapted quickly by modifying the crucial message features. Examples in-
making sure such obvious words are not encountered clude introducing spaces, special symbols, asterisks in
verbatim in their messages. To defeat filters, they re- words or HTML, JavaScript, CSS layout tricks. A clas-
sorted to simple obfuscation techniques like breaking sic example of evading the recognition of the word ‘VI-
the word into multiple pieces, as - AGRA by the spam filter would be ‘V-I-A-G-R-A’.
– f-r-e-e
embedding special characters
2.2 Bayesian Poisoning Attacks
– fr<!--xx-->ee
using HTML comments A usual criticism of statistical spam filters (e.g. Spam-
– \item <a href=’m&#97;i&#108;to&#58;%&#54; Bayes,DSPAM, POPFil e) is that they are susceptible
6re&#101;’>free</a> to ‘poisoning’ by interjection of random words into the
with character-entity encoding spam messages [Fawcett, 2004], [Graham-Cumming, 2006].
– \item o fr&#101xe Bayesian poisoning is such a kind of statistical attack in
encoded with HTML ASCII codes which spammers use carefully crafted e-mails to attack
When seen by any web user, all these above sam- the heart of a Bayesian filter and thus degrade its ef-
ples look the same as “free” but they thwart simple fectiveness. The spammers add random or carefully se-
word/phrase filtering and escape the filter rules. The lected legitimate-seeming words in order to confuse the
effectiveness of filter re-training however caused spam- spam filter and cause it to believe an incoming spam e-
mers to abandon one technique and migrate to newer mail is not spam (a statistical II error ). Spammers can
obfuscation techniques. HTML-based obfuscation tech- get these common English words or Ham phrases from
niques are discussed at length in the study by [Pu and sources like - Reuters news articles, written and spo-
Webb, 2006]. Spammers resorted to the use of innocu- ken English, and USENET messages. These strong sta-
ous words to obfuscate the e-mail message content in tistical attacks have an incidental consequence too - a
order to confuse or circumvent spam filters. In general, statistical I error or simply a higher false positive rate.
there are many ways to obscure the e-mail content: The reason is that when the user trains the spam filter
misplaced spaces, purposeful misspellings, embedded with the poisoned training data, the spam filter ‘learns’
special characters (letter substitution), Unicode letter about such random words as being good evidences of
transliteration [Liu and Stamm, 2007], HTML redraw- spam [Sanz, 2008]. Paul Graham [Graham, 2002b] how-
ing, etc. Tokenization attacks are a similar spamming ever played down the effectiveness of such poisoning
technique more associated with the preprocessing stage techniques arguing that to outweigh the statistical sig-
in spam filtering. In tokenization the spammer works nificance of even one incriminating word as “viagra”,
Machine Learning for E-mail Spam Filtering: Review, Techniques and Trends 5

spammers would need many innocent words (e.g. names of recipients. Hence, backscatter qualifies as unsolicited
of ones friends and family, terms used at work, etc) bulk e-mail and is spam. Misdirected bounces from mail
which are unique for each recipient and spammers have servers, misdirected ”please confirm your subscription”
no way of figuring them out. However, evidence sug- requests from mailing lists, ”out of office” vacation au-
gests Bayesian poisoning is real and cannot be dismissed toreplies and auto-responders, challenge requests from
[Biggio et al, 2011]. Challenge/Request Systems, etc. are the major varieties
Graham-Cumming [Graham-Cumming, 2004] [Graham- of backscatter. Backscatter, also called ‘collateral spam’
Cumming, 2006], identified two types of possible at- is a direct consequence of spam.
tacks on Bayesian filters: passive (where in absence of [Cormack and Lynam, 2007] experimented with six
feedback, the spammer can at best make educated guesses) open-source filters and a test set of 49,086 messages
and active (where the spammer discovers an effective with backscatter representing a mere 1% of the to-
wordlist after getting feedback). [Lowd and Meek, 2005] tal spam in the test set. It was found that content-
investigated ‘good word attacks’ where a spammer ap- based spam filters could filter 98% of the spam, but
pends words indicative of legitimate e-mail, and found backscatter was found to be most difficult to classify
Naive Bayes extremely vulnerable to both scenarios of with nearly all the backscatter messages being misclas-
active and passive attacks. Their results showed fre- sified. Backscatter is a problem that is hard to deal
quent filter re-training could mitigate the effective of with and though spammers may be blamed for it, it
these attacks. [Wittel and Wu, 2004] explored a simple simply exists because our mail servers are configured
passive attack of poisoning with random words (a dic- to bounce messages back to fake addresses rather than
tionary attack ) and found it ineffective against CRM1141 , just reject such spam immediately [McMillan, 2008].
but effective against SpamBayes.2 A smarter passive Servers that generate e-mail backscatter can land up
attack with common or ‘hammy’ words (common word on various DNS-based Blacklists (DNSBLs). Improp-
or focused attack ) saw SpamBayes perform even worse erly configured e-mail servers gives rise to ‘open relays’
but CRM114 remained very resistant. [Stern et al, 2004] which contribute to the problem of backscatter. Open
showed that injecting common words from the English relay servers can also get listed in various DNSBLs.
language led to the performance decrease of SpamBayes.
Published research indicates that Bayesian poisoning is
real and the number of published attack methods in- 2.4 Image Spam
dicates that it cannot be dismissed and that further
investigation on poisoning of statistical spam filters is Text-based spam filters are designed only to analyze
a worthwhile task of research. different components of an e-mail (sender’s address,
header, body, attachments) and detect specific spam
characteristics. A new type of spam called image-based
2.3 Backscatter Spam
spam or image spam is a rapidly spreading. It involves
When an e-mail is sent, the sender is normally informed textual spam content embedded into images that are at-
if the e-mail could not be delivered or the delivery tached to e-mails. OCR (Optical Character Recognition)-
was delayed for some reason. E-mail servers normally based modules are effective to a limited extent against
send a bounce message notifying the sender of deliv- image-spam [Biggio et al, 2006] [Fumera, 2006]. But of-
ery problems. Such a message is termed - Delivery Sta- ten the textual content is obfuscated by spammers to
tus Notification (DSN). Mostly, DSNs are welcome to evade OCR tools. Till 2010, the upsurge of spam e-mails
the sender and they are generally sent to the envelope meant that roughly up to 85% of all e-mail spam were
sender address. Backscatter occurs when DSNs are sent image spam [Wu and Tsai, 2008].
to senders whose addresses are forged in the message SpamAssassin, a widely used commercial and open-
envelope by spammers. In other words, backscatter are source spam filter provides several OCR plug-ins (e.g.
delivery notifications from another server, rejecting an OCR Plugin 3 , Fuzzy OCR Plugin4 , and Bayes OCR
e-mail made to come across as being mailed from an Plugin5 ) that can be used to detect image spam. It has
account [Cormack and Lynam, 2007]. These mails are been established from current literature that the apply-
then delivered unsolicited in bulk quantities to a lot ing modern classification approaches to the generated
1 3
CRM 114 - the Controllable Regex Mutilator, an open- A SpamAssassin OCR plug-in is maintained at:
source spam filtering device http://wiki.apache.org/spamassassin/OcrPlugin
2 4
SpamBayes - a popular open-source spam filtering tool, Fuzzy OCR is available but no longer maintained.
5
with 700,000 downloads, is based on techniques laid out by A beta version of BayesOCR plugin is available at
Paul Graham. http://pralab.diee.unica.it/en/BayesOCR
6 Alexy Bhowmick, Shyamanta M. Hazarika

text from image spam is very efficient. Later, signa- spoofed websites, or even recruit new bots, and so on.
tures were also generated to easily detect and filter al- Botnets on the other hand constitute a major threat to
ready known image spam. The spam filter database by the Internet infrastructure as they have the capability
mid-2012 contained more than 40 million relevant spam to - mount crippling denial of service (DoS) attacks on
signatures [IBM, 2012]. In order to avoid signature- servers, generate click-fraud [Perera et al, 2013], send
based techniques, spammers switched tactics by mak- out a flood of spam and backscatter [Xie et al, 2008]
ing arbitrary alterations to a specified template image. facilitate phishing and pump-and-dump schemes, form
They began employing obfuscation, similar to the ap- a computational grid to break weak passwords or ob-
proach usually applied in web forums, to outsmart Op- fuscate the operators point of origin, etc. Botnets run
tical Character Recognition (OCR) tools. Lately, Pat- on the global level outside the range of national bound-
tern Recognition techniques and Computer Vision are aries. According to public tracker Shadowserver 6 , at
playing a significant part in filtering of multimedia data. least one million zombie machines or bots are believed
However, the solutions achieved so far have shortcom- to be active and the number is still growing.
ings, and their efficiency is yet to be systematically in-
Identifying and blacklisting each and every bot is
vestigated [Biggio et al, 2006].
challenging, both because a botnet attack is momen-
Image-based filtering involves extraction of relevant
tary and the fact that a single bot transmits only a
features from the image and classification by state-of-
small volume of spam e-mails to avoid detection. On the
the-art classifiers. Image-based spam detection is an
other hand spammers are using large Botnets to send
example of classification of multimedia data. A num-
spam, thus creating extremely a huge number of IP ad-
ber of researchers have devised approaches based on
dresses to be blacklisted. Grum, a sneaky, kernel-mode
Pattern Recognition and Computer Vision to address
rootkit was of notable interest to researchers. It was a
different forms of image spam. In general they can be
relatively small botnet with only 600,000 members. Yet
grouped into two broad categories: a) OCR-based tech-
it was responsible for almost 25 percent, or 40 billion
niques and b) Low level image features based tech-
spam e-mails a day before it was finally taken down.
niques. The use of OCR tools to extract text embedded
Identifying botnets is a new challenge for the anti-spam
into images, and processing it using modern text cat-
industry, and tracking spammers and bringing them to
egorization techniques was thoroughly investigated by
justice, and pulling down botnet servers becomes an
[Fumera, 2006]. But OCRs have been proven to be com-
international undertaking. July 18, 2012 saw the take
putationally expensive and not accurate enough in ad-
down of the Grum botnet [Sophos, 2013]. Recently, as a
versarial situations [Goodman et al, 2007] [Attar et al,
repercussion of the bombing incident during the Boston
2011]. [Biggio et al, 2006] surveyed and categorized the
Marathon which happened on April 15, 2013, botnet
major techniques which have been suggested as image
spam related to the Boston Marathon bombing was
spam solutions. Spilling of image spam onto social net-
found to have constituted 40 percent of all spam mes-
works like Twitter or Facebook has become widespread.
sages transmitted globally on subsequent days [CISCO,
Extraction of features for image-based spam filters is
2014].
further discussed in Sec 3. A detailed and recent review
involving definitions, spam tricks, complete classifica- According to CISCO report [CISCO, 2007], botnets
tion of image spam filtering techniques and datasets are the primary security threat on the Internet today.
may be found in [Wu and Tsai, 2008]. Botnets are hard to detect because of their dynamic na-
ture and their adaptability in evading the common se-
curity defenses. Botnets have been studied thoroughly,
2.5 Botnet Spam
particularly in the context of spam and phishing [Xie
et al, 2008], [John et al, 2009] and [Zhuang et al, 2008].
At a time when blacklists had almost put the spam-
Botnets are emerging as the most severe threat against
mers out of business and diminished their profits, some
cyber-security as they provide a distributed platform
enterprising spammers joined hands with virus and ex-
for several unlawful activities like distributed denial of
ploit code writers to get access to compromised ma-
service (DDoS) attacks, malware dissemination, phish-
chines on the Internet known as ‘bots’ or ‘zombies’.
ing, scanning and click fraud. Because botnets attack
The term botnet applies to an army of machines that
from multiple fronts there is no single technology that
are compromised and controlled by a single ‘botmaster ’.
can provide protection from it.
A bot, when subverted (e.g. by a virus/Trojan infec-
tion or by a specific bot software), can be used to send
out spam or malware, harvest password and login in-
formation for identity theft and fraud, re-route users to 6
https://www.shadowserver.org/wiki/
Machine Learning for E-mail Spam Filtering: Review, Techniques and Trends 7

2.6 Social Engineering - Phishing often were found to be Barclays, Bank of America, Pay-
Pal, eBay etc [Ludl et al, 2007]. Phishing attacks and
Spammers are increasingly adopting the use of social identity theft-based scams are becoming more sophis-
engineering techniques in the spam campaign. Patient ticated in their exploitation of social engineering tech-
and committed attackers perform extensive research and niques. While spamming affects bandwidth; social en-
gain a sophisticated understanding of the needs and gineering attacks like phishing directly affect their vic-
motivation of recipients and then contact them with tims. In recent years ‘pharming’ has evolved to be a ma-
highly believable communications (e.g. e-mails or social jor concern to e-commerce and banks sites. In ‘Pharm-
networking message) which may reflect knowledge of ing’ the attacker redirects unsuspecting users to fake
the individuals’ work activities, colleagues, friends, and sites or proxy servers with seeded scripts [Abu-nimeh
family. Phishing is an illegal attempt that exploits both et al, 2007] [Kaspersky, 2014]. The Internet Security
social engineering and technical deception to acquire Threat Report [Symantec, 2014] states that in 2013 the
sensitive confidential data (e.g. social security number, rate of phishing had increased, from 1 in 414 for 2012
e-mail address, passwords, etc.) and financial account to 1 in 392 in 2013. Much of these phishing attempts in-
credentials [Robinson, 2003] [Bergholz et al, 2010]. Phish- volve the creation of fake login pages for popular social
ing involves spam e-mails disguised as legitimate with a networks sites. Besides spoofing login pages of legiti-
subject or message designed to trick the victims into re- mate sites, phishers also began launching baits relevant
vealing confidential information. In deceptive phishing, to current events for flavouring the phishing pages.
e-mail notifications appearing to come from credit card
Several browser extensions (e.g. SafeCache and Safe-
companies, security agencies, banks, providers, online
History for Mozilla) and plug-ins (e.g. SpoofGuard) have
payment processors or IT administrators are commonly
been proposed [Chou et al, 2004], [Stepp, 2005], [Nat-
used to exploit the unsuspecting public. The notifica-
takant, 2009] and [Sta, 2014]. [Chandrasekaran et al,
tion encourages the recipient to urgently enter/update
2006] have pointed out several weaknesses of existing
their personal data. In most cases, the fraudsters try
browser-based solutions and proposed a novel Support
to frighten a recipient by some ”urgent” matter (e.g.
Vector Machine (SVM) - based technique for e-mail
”We suspect an unauthorized transaction on your ac-
spam filtering based on the inherent structural prop-
count. To ensure that your account is not compromised,
erties in phishing e-mails. [Abu-nimeh et al, 2007] eval-
please click the link below and confirm your identity”)
uated the predictive accuracy of six popular machine
that requires their immediate attention and divulging
learning-based classifiers on phishing data sets. Phish-
of their personal information. It is often accompanied
ing countermeasures such as secure e-mail authentica-
by a threat to block the account within a limited period,
tion, password hashing, etc. involves high administra-
if not responded. Once information such as user-name
tive overhead, hence content-based filtering can be used
and password are entered, it becomes a clear case of
to detect phishing attacks and improve existing solu-
identity theft followed by worse consequences such as
tions. While we agree client-side solutions for phishing
transfer of cash from a victims account, official docu-
have been developed over the years even by huge soft-
ments being obtained, or goods being purchased using
ware companies, server-side solutions are the focus of
stolen credentials. Malicious users are also interested in
research [Abu-nimeh et al, 2007], [Fette et al, 2007] and
other types of passwords, such as those for social net-
[Basnet et al, 2008].
works, e-mail accounts and other services [Kaspersky,
2014]. In malware-based phishing, malicious software is Bergholz et al [2010] have identified a number of
spread through e-mails or by exploiting security loop- highly informative features about phishing attempts
holes and installed on the user’s machine. The malware and also proposed a server-side statistical phishing fil-
may then capture user inputs, and confidential infor- ter. The success of phishing is largely determined by the
mation may be sent to the ‘phisher’ [Bergholz et al, low levels of user-awareness regarding how the fraud-
2010]. The phishers’ top targets in 2012 were social net- sters and spoof sites operate. Increasing user aware-
works, financial institutions, non-profit organizations ness will help them to learn to spot the telltale signs
and search engines [IBM, 2014] [Kaspersky, 2014]. of social engineering tricks, which includes, undue pres-
Phishing attacks use e-mail as their main carrier in sure, a false sense of urgency, bogus official letters, too-
order to allure unmindful victims. Phishing can also good-to-be-true offers, quid-pro-quo offers, etc. Mean-
occur on a fake web site that is a perfect replica of while spam filters remain the first line of defense against
the official site, such as the log-in page for a banking phishing. According to Anti-Phishing Working Group
web site, to harvest e-mail addresses and log-on cre- [Anti-Phishing Working Group (APWG), 2014] new brands
dentials of their victims. The companies spoofed most continue to be targeted by phishers and to battle these
8 Alexy Bhowmick, Shyamanta M. Hazarika

phishing attacks, presently the world needs more phish- employing word stemming and lemmatization are
ing databases. feature space dimension reduction and classifier ac-
curacy.
– Representation: Involves the conversion of an e-
3 Corpus Preprocessing mail message into a specific or structured format
as needed by the machine learning algorithm being
Not all information present in an e-mail is necessary employed.
or useful. Eliminating the less informative and noisy
terms lowers the feature space dimensionality and en- [Androutsopoulos et al, 2000a] studied the effect of
hances classification performance in most cases [Guzella corpus size, lemmatization, and stop-lists while in [An-
and Caminhas, 2009], [Diao et al, 2003] and [Shi et al, droutsopoulos et al, 2000c], they studied the effect of
2012]. Corpus preprocessing is a process that involves word stemming and stop-word removal on the perfor-
transforming the mail corpus into a uniform format that mancce of classifiers. Their results show that often they
is more comprehensible to the machine learning algo- do not contribute to much improvement over the filters
rithms [Zhang et al, 2004], [Katakis et al, 2007]. Due without them. [Chih-Chin Lai and Tsai, 2004] found
to the adversarial nature of spam, spam filters need to that stemming did not introduce any significant im-
constantly adapt to changing spam tactics, particularly provement in the filter’s performance, though it did re-
in feature extraction and feature selection aspects. No duce the feature set size. On the contrary, employing
matter which learning strategy is chosen for the train- stopping produced better performance.
ing and testing of content-based filters, it is extremely
crucial to handcraft a private corpus or use a corpus
that is publicly available. In any case, e-mails need to 3.1 Extracting Features
undergo preprocessing as a preparation for feature ex-
The easiest feature extraction method is the bag of
traction. Furthermore, a corpus may have an immense
words (BOW) model (or vector-space model ), in which
number of features, it is very important to choose fea-
words occurring in the e-mail are treated as features.
tures judiciously so as to prevent the classifiers from
Given a set of terms T = {t1 , t2 , t3 ...tn }, the bag of
over-fitting [Drucker et al, 1999]. The effectiveness and
words model represents a document d as an N-dimensional
success of content-based spam filters depends on - fea-
feature vector x = {x1 , x2 , x3 ...xn } where xi is a func-
ture engineering i.e. defining and creating those features
tion of the occurrence of ti in d. It is possible to use all
more likely to make the classifier perform better. The
the features for classification. However a feature selec-
primary steps involved in extraction of features from an
tion mechanism may be applied to select the best N fea-
e-mail are -
tures by some measure and thus reduce dimensionality.
– Lexical Analysis (Tokenization): The string of Another simple text representation is the bag of charac-
text representing a message is tokenized in order ter n-grams. [Kanaris et al, 2006] investigated on char-
to identify the candidate words to be adopted as acter n-grams and words in spam filtering to demon-
relevant spam or ham terms. Headers, attachments, strate the advantage of n-grams over word-tokens. Sparse
and HTML tags are stripped, leaving behind just Binary Polynomial Hashing (SBPH) [Yerazunis, 2003]
the e-mail body and subject line text. IP addresses is another feature generator from e-mails. However, its
and domain names can also be considered as tokens. many features made it computationally heavy and of
– Stop-word Removal : Stop-word removal involves limited use. Siefkes et al [2004] proposed an effective
removing frequently used non-informative words, e.g. feature combination technique known as the Orthog-
‘a’, ‘an’, ‘the’, and ‘is’, etc. Obscure texts or sym- onal Sparse Bigrams (OSB) to extract more compact
bols may also be removed in subsequent steps. Stop- features. Experiments showed that OSB slightly per-
word removal makes the selection of candidate terms formed better than SBPH with regard to error rate.
more efficient and reduces the feature space consid- Recently, [Zhu and Tan, 2011] proposed a feature ex-
erably. traction approach based on local concentration (LC)
– Stemming : Word-stemming is a term used to de- which efficiently extracted position-correlated informa-
scribe a process of converting words to their mor- tion from e-mail messages. For each style of e-mail anal-
phological base forms, mainly eliminating plurals, ysis, a spam filter developer must decide on a way for
tenses, gerund forms, prefixes and suffixes. Stem- performing feature extraction.
ming is closely related to lemmatization which while
reducing a word considers the part of speech and
the context of the word. The primary advantages of
Machine Learning for E-mail Spam Filtering: Review, Techniques and Trends 9

3.2 Feature Selection Fig. 3 The header of a typical e-mail.

Feature selection is a key issue and has become the sub-


ject of much research. It has a three-fold objective: i) unknown in spam filtering research before and much
enhancing the prediction accuracy of the classifiers, ii) research focused on the e-mail message body only.
building faster and economical classifiers, and iii) ob- According to [Zhang et al, 2004], a spam filter trained
taining a better understanding of the elementary pro- using header features alone can achieve better or com-
cess involved in generation of data [Guyon, 2003]. Di- parable results than the body solution. Statistical anal-
mensionality reduction and feature subset selection are ysis by [Wang and Chen, 2007] showed that 92.5% of
two preferred techniques for lowering the feature set 10,024 junk e-mails were filtered out using the header
dimension. While feature subset selection involves the features - message-ID, mail user agent (MUA), sender
extraction of a subset of the original attributes, dimen- address etc. [Hu et al, 2010] and [Al-jarrah et al, 2012]
sionality reduction involves linear combinations of the note performance evaluation of several header-based
original feature set [Gansterer and Ecker, 2008]. [Cor- spam classifiers and evaluated their performance in fil-
mack, 2008] suggests stop-word removal as a trivial ex- tering e-mail spam. [Sheu, 2007] mined association rules
ample of feature selection, and stemming as a simple out of other basic attributes in the e-mail header ses-
example of dimensionality reduction. Information Gain sions and proposed an efficient decision tree-based spam
(IG) is one of the simplest and most successful tech- filtering method. E-mail Header analysis has evolved to
niques for feature selection. As discussed earlier, natu- be a very promising research area. As a filter technique
ral language processing provides different feature selec- it has the capability to provide low false positive rates
tion ways, the simplest being the ‘bag of words’ model either by itself or when used with other anti-spamming
coupled with ‘stemming’ and ‘stopping’. [Zhang et al, techniques.
2004] investigated the impact of three popular feature
selection techniques - Document Frequency (DF), In-
formation Gain (IG) and χ2 (CHI) test. A novel fea-
3.4 Filters Based on Non-content Features
ture selection method named Comprehensively Measure
Feature Selection (CMFS) was presented and evaluated
Much research has been accomplished in e-mail classi-
with popular feature selection methods - Information
fication proposing general and specific solutions to the
Gain (IG), Chi statistic (CHI), Document Frequency
spam problem. However, most of these approaches ex-
(DF), Orthogonal Centroid Feature Selection (OCFS)
plored only the content-based features [Drucker et al,
and DIA association factor (DIA) to demonstrate that
1999], [Androutsopoulos et al, 2000b], [Androutsopou-
the new method notably outperformed them all [Yang
los et al, 2000a] and [Sakkis et al, 2001]. Filtering based
et al, 2012]. Several well-known methods for feature
solely on e-mail content has been argued to be a fun-
selection are explained and compared with new fea-
damentally flawed idea. Although such content-based
ture selection methods in [Yang and Pedersen, 1997],
methods have been effective, the perfectly malleable
[Yang et al, 2011] and [Gomez et al, 2012]. [Toolan and
content of an e-mail and spammers reactivity to filter-
Carthy, 2010] address the issue of effective feature se-
ing methods gives rise to many challenges [mentioned in
lection by exploring the utility of over 40 features (ex-
Section 2.1 and 2.2]. Different features such as tempo-
tracted from ham, spam and phishing pages) that have
ral information, message length, MIME content type,
been used in recent literature.
proportion of symbols in e-mail body, presence of at-
tachments, number of URLs in the e-mail, etc., are con-
3.3 E-mail Header Analysis sidered non-content features, and have led to promising
results in differentiating incoming e-mails. Non-content
E-mail headers determine the recipient of a message and features may include header features such as ‘origi-
record the specific route the message takes as it passes nator field ’, ‘destination field ’, ‘X-mailer field ’ etc. but
through each mail server. Message headers are very re- they are not limited to header features. [Hu et al, 2010],
liable and powerful sources containing discriminative [Hershkop and Stolfo, 2005], and [Wang et al, 2005] de-
features for spam filtering besides the Subject and e- scribe exploiting non-content features for profiling e-
mail content. In fact, experimental results confirm that mails and developing efficient and scalable non-content
the e-mail header provides powerful cues for machine based spam-filtering frameworks. Table 1 illustrates the
learning algorithms to efficiently filter out spam e-mails popular approaches for feature extraction and feature
[Chih-Chin Lai and Tsai, 2004], [Zhang et al, 2004], selection adopted by researchers and their key infer-
[Sheu, 2007] and [Wang and Chen, 2007]. This fact was ences.
10 Alexy Bhowmick, Shyamanta M. Hazarika

Table 1 A summary of Feature Extraction and Feature Selection techniques in popular literature.

Authors Approaches

[Zhang et al, 2004] Studied subject line, header, and message body.
Employed Information Gain (IG), Document Frequency (DF), and χ2 test (CHI) for feature
selection.
Found bag of words model quite effective on spam filtering, and header features as important as
message body.

[Kanaris et al, 2006] Extracted character n-grams of fixed length, Variable-length character n-grams
Explored Information Gain (IG) as a feature selection technique
Character n-grams were noted to be richer and definitve than word-tokens.

[Delany and Bridge, 2006] Considered features of three types: word, character, structured features. in a feature-based vs
feature-free comparison.
Employed Information Gain (IG) as a feature selection technique
Noted feature-free methods to be more correct than the feature-based system, however feature-
free approaches took much longer than feature-based approach in classifying e-mails.

[Yeh et al, 2005] Used behavioral patterns of spammers, Metaheuristics as features


Employed Term Frequency, Inverse Document Frequency (TFIDF), SpamKANN for feature
selection
Tested SVM, Decision trees, Naive Bayes to get increased prediction accuracy than keywords.

[Diao et al, 2003] Experimented on features: Header (H), Textual (T), handcrafted features (HH), etc.
Different ways of feature selection for Decision Tree and Naive Bayes models were evaluated
The usefulness and importance of different type of features were discussed in detail in experi-
ments.

[Méndez et al, 2006] Considered subject, body, header, attachment feature.


Analyzed strength and weaknesses of Document frequency (DF), Information Gain (IG) and χ2
test (CHI), Mutual Information.
Presented a deep analysis of feature selection methods. Found e-mail attachments to be useful
when integrated with models.

3.4.1 Analyzing Temporal Features curacy is perhaps one of the most uncharted territories
in spam filtering research.

As a novel solution to the spam problem, [Kiritchenko


et al, 2004] employed temporal features of an e-mail to
the conventional content-based approaches to create a
richer information space to work with. A simple exam- 3.4.2 SMTP Path Analysis
ple of a temporal feature, obtainable from the message
header timestamps, is the day of the week or the time of SMTP path analysis operates by learning about the
the day the e-mail was received. They represented tem- ‘spamminess’ or goodness of IP addresses by examining
poral information in the form of temporal patterns, pre- the history of e-mail delivered through that IP address.
sented an algorithm for mining temporal patterns in an SMTP traffic analysis when used in combination with
e-mail sequence and described approaches to integrate traditional filters does improve the accuracy of the fil-
temporal patterns into content-based e-mail classifica- ters. [Leiba et al, 2005] established that examining IP
tion. [Hao et al, 2009] explored various spatio-temporal addresses was useful and presented a new algorithm for
features of e-mail senders and investigated ways to de- learning the reputation of e-mail domains and IP ad-
duce the reputation of an e-mail sender based only on dresses by examining the SMTP path used to transmit
such features. To improve the state of affairs they pre- e-mails. Beverly and Sollins [Beverly and Sollins, 2008]
sented SNARE - a sender reputation system with ro- examined a variety of SMTP flow characteristics and
bust classification accuracy. Investigations reveal that developed a spam classifier ‘SpamFlow ’ based on the
the use of temporal features to improve spam filter ac- statistical discriminatory power of these flows.
Machine Learning for E-mail Spam Filtering: Review, Techniques and Trends 11

3.4.3 Behavior Analysis Articles classified under ‘Algorithm’ reflect research that
essentially focused on classification algorithms and their
The behavioral pattern of an e-mail is ‘what the sender implementations and evaluations. Articles classified un-
does in composing or distributing e-mails’. Legitimate der ‘Architecture’ concentrated on work mainly involved
e-mails have mostly normal and meaningful behavioral with the development of spam filtering infrastructures.
patterns, while spam e-mails have abnormal or even Articles classified under ‘Methods’ refers to study of
conflicting behavior patterns. [Yeh et al, 2005] consid- the existing filtering methods while ‘Trends’ speaks of
ered behavior patterns such as data spoofing, time discourses concentrating on emerging methods and the
anomaly, relay anomaly, etc; and described them by adaptation of spam filtering methods over time. Limi-
meta-heuristics and employed them as features for the tations listed in the last column, corresponding to each
classification task. To recognize spam and viruses as article are as acknowledged by the authors themselves.
irregular behviour in the e-mail, [Hershkop, 2006] pro-
posed some behavior models, some of them are recip-
ient frequency, group communication, user’s past ac- 4 Methods for Mitigating E-mail Spam
tivity histogram, etc. [Ramachandran and Feamster,
2006] studied the spammer’s behaviour at the network Although there are ‘social ’ methods like legal measures
level and found that most spam was received from a and personal measures (e.g. never respond to spam,
small number of regions of IP address space. They sug- never forward chain-letters) to fight spam, they have
gested that filtering based on network-level character- had a narrow effect on spam so far is seen by the num-
istics would be much more effective to combat spam ber of spam messages received daily by users. Technical
as network-level properties are less malleable than e- measures seem to be the most effective in countering
mail content. [Li et al, 2007] performed an experimen- spam. Prior to machine learning techniques, many dif-
tal study of the community behavior of spammers and ferent technical measures were employed for spam fil-
came up with various clustering structures among their tering, like - rule-based spam filtering, white lists, black
population. Based on those structures they proposed lists, challenge-response (C/R) systems, spam filtering,
some group-based anti-spam strategies exploiting group honey pots, OCR filters, and many others, each with
membership of perceived spam sources. Further work on its own merits and drawbacks. Black-lists, white-lists,
investigating clustering structures of spammers based challenge-response (C/R) systems, etc. are origin-based
on features as - Content length, Time of arrival, Fre- techniques used by reputation-based filters. We discuss
quency of e-mail, etc. was carried out by [Hao et al, briefly some of these popular approaches:
2009].

4.1 Heuristic Filters


3.4.4 Analyzing Users Social Network
Initial spam filters followed the ‘knowledge engineering’
Social networks are very helpful for determining the approach and were based on coded rules or heuristics
trustworthiness of outsiders and hence recent spam fil- Sanz [2008]. A content-based heuristic filter analyzes
tering approaches have started to exploit social network the contents of a message M and classifies it to spam
interactions to distinguish between spam and ham. For or ham based on the occurrence of ‘spammy’ words like
their social network based classification scheme, [Boykin ‘viagra’ or ‘lottery’ in it. They were designed based on
and Roychowdhury, 2005] analyzed the e-mail header the knowledge of regularities or patterns observed in
fields to construct a social network graph of the user, messages Guzella and Caminhas [2009]. Cohen’s Cohen
and then classified e-mail messages based on ‘clustering [1996] was one of the earliest attempts to use learning
coefficient’ of the graph subcomponent. The clustering machines that classify e-mail. Based on the RIPPER
coefficient is very low for spammers, while it is high for rule-learning algorithm he employed a new method for
a network of friends. The algorithm was found to be im- learning from corpus sets of ”keyword-spotting rules”
mune to false positives and could correctly classify 50% to classify personal e-mails into pre-defined categories.
of all e-mails correctly. [Chirita et al, 2009] and [Gol- He showed that the RIPPER algorithm can achieve a
beck and Hendler, 2004] further developed the idea of comparable performance to a traditional information
creating a social network graph for inferring reputation retrieval (IR) method based on TF-IDF weighting.
ratings of individuals or e-mail addresses. The drawback of heuristic filters is that maintain-
Table 2, summarizes and categorizes popular ma- ing an effective set of rules is a time consuming affair,
chine learning attempts by authors according to per- moreover the rules have to be constantly updated to
spective (Algorithm, Architecture, Methods, and Trends). keep up with the newest trends in spam. Spammers
12 Alexy Bhowmick, Shyamanta M. Hazarika

Table 2 A summary popular machine learning attempts by authors according to perspective (Algorithm, Architecture, Meth-
ods, and Trends), with their strengths and limitations.

Ref. Perspective Strengths and Limitations

Naive Bayes, k-NN, ANN, SVM Techniques benefits beginners.


Tretyakov [2004]
Algorithms, Methods Does not deal with feature selection.

Naive Bayes, LogitBoost, SVM Resulted in - LingSpam and PU1.


[Androutsopoulos et al, 2006]
Algorithm, Methods, Trends Ignored headers, HTML, attachments.

Bayesian filtering Broad review of implementations.


[Carpinter and Hunt, 2006]
Methods, Architecture Focuses primarily on automated, filters.

SVM, TF-IDF, Boosting Explains feature extraction methods.


[Blanzieri and Bryl, 2008]
Algorithms, Methods, Trends Does not cover neighboring topics.

SVM, Perceptron, Winnow, OSBF Testing achieves FPR = 0.2 %.


[Cormack, 2008]
Algorithms, Methods, Trends User feedback difficult to simulate.

Regression, Ensembles Focuses on textual and image analysis.


[Guzella and Caminhas, 2009]
Algorithms, Methods Focuses only on application specific aspects.

SVM, Naive Bayes Proposed Matthews correlation coefficient (MCC).


[Almeida and Yamakami, 2010]
Algorithms, Methods Need to compare with other algorithms & corpuses.

MDL principle, SVM Uses six, well known, large public databases.
[Almeida and Yamakami, 2012]
Algorithms, Methods Bogofilter, SpamAssassin filters not considered.

Signature, k-NN, ANN, SVM Focuses on distributed computing paradigms.


[Caruana and Li, 2012]
Methods, Architecture Avoids implementation and interoperability issues.

Statistical analysis, n-grams Investigated topic drift.


[Wang et al, 2013]
Trends Limited datasets.

began employing content ”obfuscation” (or obscuring), and maintained either at the user or server level. If a
by disguising certain terms that are very common in user receives an e-mail from any of these addresses, the
Spam messages (e.g., by writing ”v!@gra” instead of message is automatically blocked at the SMTP connec-
”viagra”, or ”F*r*e*e” instead of ”Free”) on an attempt tion phase. This method requires only a simple lookup
to prevent the correct identification of these terms by in the blacklist every time; hence the computational
Spam filters. Moreover writing regular expression-based cost is low. Black-lists include Real-time Blackhole ListS
rules are hard and error prone. In spite of these limi- (RBL) and Domain Name System Black-lists. Com-
tations, Symantec Brightmail Sanz [2008], a rule-based mon black-list databases include proxies or open re-
filter solution was a success from 2004 till the end of the lays, networks or individual addresses guilty of sending
last decade. It could even track down IP addresses that spam. Google blacklists and SpamHaus 7 are examples
sent mostly junk mail and performed competitively to of blacklists.
SpamBayes - a popular Nave Bayes-based anti-spam Blacklist techniques though effective, suffer from many
solution. drawbacks. A legitimate address may be blacklisted by
the filter erroneously or arbitrarily. Innocent users can
get victimized and entire domains (e.g. Hotmail) can
4.2 Blacklisting get blocked when e-mail IDs or IP addresses are used
by spammers without the owners consent. As spam-
A blacklist of E-mail addresses or IP addresses of the
server from which spam is found to originate is created 7
http://www.spamhaus.org/
Machine Learning for E-mail Spam Filtering: Review, Techniques and Trends 13

mers resort to use of new E-mail IDs or IP addresses 4.5 Challenge Response (CR) systems
to bypass blacklists, frequent updates are required to
keep the blacklists up-to-date. Lately, use of botnets by While white-lists place the burden of determining the
spammers creates an extremely huge number of IP ad- authenticity of contacts on the receiver, Challenge-Response
dresses to be blacklisted. While the time and effort for (CR) systems transfer the burden of authentication back
updating can be overwhelming, any lag in its timeliness to the sender. After sending an e-mail to the receiver,
leads to vulnerabilities. the sender receives a challenge from the receiving Mail
Transfer Agent (MTA). The challenge may range from
a simple question to a CAPTCHA (”Completely Auto-
mated Public Turing test to tell Computers and Humans
Apart”). The sender is obliged to reply correctly in his
4.3 Whitelisting response; else his message will be deleted or put into
spam folder. While this method is effective in catch-
Whitelisting is the reverse of blacklisting. An e-mail ing spam from automated systems or botnets, it intro-
whitelist is a list of pre-approved or trusted contacts, duces an undesirable delay in the delivery process. CR
domains, or IP addresses that are able to communicate systems are controversial solutions and are often criti-
to a mail user. All e-mails from fresh e-mail addresses cized due to this inconvenience caused by the overhead
are blocked by this method. This restrictive method in communication. Besides legitimate e-mails from au-
may introduce an extremely high false positive rate tomated mailing lists may also be blocked since these
instead of reducing it. Such a method may be good will fail the challenge. In addition, CR systems are also
for instant messaging environments but is not a good believed to be the cause behind the backscatter e-mail
choice as it prohibits establishing new contacts through phenomenon. [Isacenkova and Balzarotti, 2011] devel-
e-mail. Moreover if spammers somehow got their hands oped a real world deployment of a CR based anti-spam
on the whitelist, it would be easy to evade the filter us- system and evaluated its effectiveness and impact on
ing spoofed addresses, or using well-known whitelisted end-users.
mailing lists. This method requires a lot of maintenance
but provides moderate filtering rate. It can be employed
together with other anti-spam techniques [Michelakis
et al, 2004]. 4.6 Collaborative Spam filtering

Spammers typically send spam to a vast number of


recipients. It is likely that the same spam has been
received by somebody else. Collaborative spam filter-
4.4 Greylisting
ing is a distributed approach to filtering spam where a
whole community works together with a shared knowl-
When an SMTP client connects requesting for a session edge about spam [Méndez et al, 2006], [Sophos, 2013],
for the first time, the recipient server may check if the and [Garriss et al, 2006]. The collaborative approach
IP address of the sender or its e-mail address is blocked does not consider the content of e-mails; rather it re-
or pre-approved. It may happen they are neither in the quires the accumulation of any identifying information
blacklist nor in the whitelist. In that case the message concerning spam messages, like - the subject, sender,
is rejected temporarily and the recipient MTA responds the result of computing a mathematical function over
with an SMTP temporary error message. The recipient the email body, etc. Spam messages have digital foot-
MTA then records the identity of recent attempts and prints which are shared with the community by early
its databases are updated with the new clients informa- receivers. The community users then use these spam fin-
tion; as required by SMTP RFC [P. Resnick, 2001], the gerprints for identifying spam e-mails. Vipuls Razor 8 ,
client retries at a later point of time. The next attempt Pyzor 9 , DCC (Distributed Checksum Clearinghouse)
may be accepted for legitimate senders. This method 10
are examples of collaborative spam filters on the web.
assumes that spammers do not waste time in queuing Though it is certain that collaborative techniques show
or retrying their messages and those who do so will great promise, however such schemes suffer from scala-
probably end up being blacklisted in public blacklists bility issues and some underlying implicit assumptions.
(DNSBLS) during the two attempts. While this tech-
nique seems very effective, evading it can also be very 8
http://razor.sourceforge.net/
simple. The spammers can use zombies to do the work 9
https://github.com/SpamExperts/pyzor
of retrying for the spammer. 10
http://www.rhyolite.com/dcc/
14 Alexy Bhowmick, Shyamanta M. Hazarika

4.7 Honey pots popular Machine Learning techniques to counter spam


filtering are Naive Bayes [M. Sahami, S. Dumais, D.
A honeypot is a decoy server or system set up solely Heckerman et al, 1998], [Androutsopoulos et al, 2000a],
to collect spam or gather information about intruders Support Vector Machines [] Woitaszek and Shaaban
Andreolini et al [2005]. It is also used to identify e- [2003], [Amayri and Bouguila, 2010]], Decision Trees
mail address harvesters with the help of specially gener- [Yeh et al, 2005], [Toolan and Carthy, 2010], Neural
ated e-mail addresses and to detect e-mail relays. It is a Networks [Wu, 2009], [Soranamageswari and Meena,
finger-print based technique for content based spam fil- 2010], etc. The building of the model or classifier re-
tering. Honeypots do help security professionals and re- quires a set of pre-classified documents (training set or
searchers learn the techniques used by attackers to com- an initial corpus). The process of building the model is
promise computer systems. Bringer et al [2012] present called training.
a proper survey on the evolution in honeypots as well as Machine learning algorithms have achieved more suc-
advances and the current trends in honeypot research cess among all previous techniques (discussed in Sec-
to cope with recently emerging security threats. tion 4) employed in the task of spam filtering [Fdez-
Riverola et al, 2007a], [Almeida et al, 2010]. In fact,
the success stories of Gmail [Taylor et al, 2007], [The,
4.8 Signature Schemes 2010], can be ascribed to their timely transition and
successful use of Machine Learning for filtering not just
Most current antivirus products work on the basis of incoming spam but other abuses like Denial-of-Service
signatures. The hashes of previously identified spam (DoS), virus delivery, and other imaginative attacks
messages are kept in a database at the Mail Transfer [Taylor et al, 2007]. Today the most successful spam
Agent (MTA) level. All incoming e-mails are checked filters are based upon the statistical foundations of Ma-
against these hashes to distinguish between spam and chine Learning. In part it is because it is easier to train
legitimate e-mails. Because signatures match exact pat- and build a classifier on e-mails that individual mail
terns, this scheme can detect known spams with a very users receive, than to build and tune a set of filtering
high level of confidence. However, a strong shortcoming rules. Machine Learning based spam filters also retrain
is that unknown or newly generated spam will be able themselves while put in use and minimizes manual ef-
to get past this filter without being detected. Signature fort while delivering superior filtering accuracy. In this
databases need to be updated hourly, daily or weekly. section we explore the underlying theory and aim to
The database can swell as thousands of spams are gen- present a clear picture of popular Machine Learning al-
erated every day. Spammers can introduce a random gorithms employed in spam filtering for the benefit of
string into spam mails to generate different hashes. readers unfamiliar with them. Table 3 provides a tax-
This review article examined a number of major ear- onomy of e-mail spam filtering techniques.
lier surveys on spam filtering over the period (2004-
2015). Perusing the different spam techniques and the
methods used by researchers to combat spam, taxon- 5.1 Naive Bayes (NB)
omy of spam filtering techniques is presented above (Ta-
ble 3). Naive Bayes classifiers are a technique that has re-
mained popular over the years and are arguably the
most well-known statistical spam classifier. It is called
5 Machine Learning Approach to E-mail Spam ‘naive’ because it ignores possible dependencies or cor-
filtering: The Algorithms relations among inputs and reduces a multivariate prob-
lem to a group of uni-variate problems [M. Sahami, S.
Spam filtering is a binary classification task, in which Dumais, D. Heckerman et al, 1998]. It employs a prob-
legitimate (good or ham) e-mails are treated as nega- abilistic approach to inference. It does not need any
tive (-) instances, and spam as positive (+) instances complicated iterative parameter estimation schemes, as
[Song et al, 2009]. Machine Learning is a subfield of in Discriminant analysis. It is easy to construct, easy
computer science that explores the design and develop- to interpret, surprisingly effective and can be readily
ment of computer systems that automatically improve applied to huge data sets [Wu et al, 2007], making
their performance in a task based on experience. Au- it extremely popular among users. Bayesian methods
tomatic e-mail classification uses statistical approaches typically require prior knowledge of many probabili-
or machine learning techniques and aims at building a ties e.g., according to Grahams [Graham, 2002a] cor-
model or a classifier specifically for the task of filter- pus, the word ‘sex ’ indicates a 97% probability that the
ing spam from a users mail stream. Some of the most containing e-mail is spam. Similarly, words like ‘viagra’
Machine Learning for E-mail Spam Filtering: Review, Techniques and Trends 15

Table 3 A taxonomy of e-mail spam filtering techniques.

Reputation-based Content-based (Textual) Content-based (Multimedia)


Reputation based Textual content
Origin based Heuristics
Multimedia content
Blacklists Rule based
OCR techniques
Whitelists Fingerprint based
Keyword detection
Origin Diversity Honeypots
Text
Analysis Digest based
Categorization
Social Networks Signature/Checksum
High Level
Implicit schemes
Analysis
Explicit Machine Learning
Low-level Features
Traffic analysis Naive Bayes
Image
Mail Volume Support Vector
Classification
SMTP Flow Machines
Near Duplicate
Protocol based Decision Trees
Detection
C-R Systems Clustering
Greylisting Ensembles

and ‘refinance’ will have high spam probability values, SBPH is a generalization of Bayesian filtering that can
while names of friends and siblings will have low spam match mutating phrases as well as individual words or
probability values. These apriori probabilities are com- tokens, and uses the Bayesian Chain Rule (BCR) to
bined with the observed data set - which is a sizeable combine the individual feature conditional probabilities
collection of e-mails that has already been categorized into an overall probability. SBPH had a more expressive
as ‘spam’ and ‘ham’, to determine the final probabil- feature space and delivered ¿99.9% accuracy on real-
ity that an e-mail message is either spam or legitimate. time e-mail without white-lists or blacklists from as lit-
Even with the flawed assumption of presumed decorre- tle as 500K of pre-categorized text. However, SBPH was
lation, Bayesian classifiers work extremely well and are computationally expensive; OSB retains the expressiv-
surprisingly effective [M. Sahami, S. Dumais, D. Heck- ity of SBPH but avoids most of the cost. A filter based
erman et al, 1998], [Pantel and Lin, 1998], [Graham, on OSB, along with the non-probabilistic Winnow al-
2003] and [Yerazunis, 2004]. gorithm as a replacement for the Bayesian Chain Rule
outperformed SBPH by 0.04% error rate; however, OSB
Nave Bayes method has become extremely popular
used just 6, 00,000 features, while SBPH used 1,600,000
due to the high levels of accuracy that it can potentially
features to reach best results. Yerazunis [2004] argued
provide and it often serves as a baseline classifier for
that most Bayesian filters seem to reach a plateau of
comparison with other filtering approaches. Bayesian
accuracy at 99.9 percent so enhancements were nec-
filters are the most employed filters for classifying spam
essary. They set up a SBPH/BCR classifier and com-
nowadays Guzella and Caminhas [2009], Metsis et al
pared three different training methods: TEFT Train
[2006] and can operate either on the network mail server
Every Thing, TOE Train Only Errors, TUNE Train
level or on client e-mail programs.
Until No Errors, and found TOE training to be ac-
One limitation of standard Bayesian filters is that it ceptable in performance and accuracy. Different exten-
ignores the correlation among inputs or events; i.e. such sions to Bayesian filtering as Token Grab Bag, Token
filters do not consider that the words ‘special ’ and ‘of- Sequence Sensitive, Sparse Binary Polynomial Hash-
fers’ are more likely to appear together in spam e-mail ing with Bayesian Chain Rule (SBPH/BCR), Peaking
than in legitimate e-mail Carpinter and Hunt [2006]. Sparse Binary Polynomial Hashing, Markovian match-
But text analysis confirms that words have a very sig- ing, were also tested. Markovian matching produced the
nificant correlation and are not chosen randomly. In best performance of all the filters.
spite of this over simplistic assumption, Bayesian classi-
fiers have been found to work remarkably well Androut- According to Ludlow [2002], the vast majority of the
sopoulos et al [2006] and Almeida et al [2010]. How- tens of millions of spam e-mails might be the handiwork
ever to address this limitation, Yerazunis [2003] and of only 150 spammers around the world; Again, au-
Siefkes et al [2004] introduced sparse binary polynomial thors have ‘textual fingerprints’, at least for texts pro-
hashing (SBPH) and orthogonal sparse bigrams (OSB). duced by writers who are not consciously changing their
16 Alexy Bhowmick, Shyamanta M. Hazarika

style of writing across texts, as argued by Baayen et al 5.2 Support Vector Machines (SVMs)
[2002]. Therefore authorship identification techniques
can be used to identify the ‘textual fingerprints’ of this Support vector machines (SVMs) are ranked as one of
small group and eliminate a significant proportion of the best ‘off-the-shelf ’ supervised learning algorithm.
spam. Brien and Vogel [2003] were the first to apply SVMs have become one of the most sought-after classi-
authorship identification techniques as ‘Chi by degrees fiers in the Machine Learning community because they
of freedom’ method to the area of e-mail spam filter- provide superior generalization performance, require less
ing. The authors examined the Nave Bayesian method examples for training, and can tackle high-dimensional
in relation to this authorship identification technique. data with the help of kernels [Rios and Zha, 2004] [Wu
They found that the Bayesian method was very effec- et al, 2007]. Support vector machines (SVMs) result
tive when characters were used as tokens, rather than by mapping the feature vectors (training data) into a
when words were used as tokens. The ‘Chi by degrees of linear or non-linear feature space through a kernel func-
freedom’ method when used with characters as tokens tion. The feature space generates an optimal separating
had an error rate lesser than the Bayes method. They hyper-plane (OSH) which splits the positive samples
concluded that tokens chosen affected the precision and and the negative samples with maximum margin. The
recall parameters. Taking a leaf out of text classifica- hyper-plane is then employed as a non-linear decision
tion, Song et al [2009] proposed a correlation-based doc- boundary for use in real-world data.
ument term weighting method to address the problem [Drucker et al, 1999] used SVMs for content-based
of low-FPR classification in the context of Nave Bayes. classification and equated their performance with other
classifiers - Ripper, Rocchio, and boosting of C4.5 deci-
sion trees. It was found that boosting trees and SVMs
[Chih-Chin Lai and Tsai, 2004] conducted system-
attained good performance with regard to speed and ac-
atized experiments on e-mail categorization involving
curacy during testing. SVMs with binary features pro-
Naive Bayes (NB), Term Frequency - Inverse Document
duced best results, required lesser training, and their
Frequency (TF-IDF), k-Nearest Neighbor (k-NN), and
performance did not degrade when too many features
Support Vector Machines (SVMs). NB, TF-IDF and
were used. Woitaszek and Shaaban [Woitaszek and Shaa-
SVM achieved satisfactory results while k-NN had the
ban, 2003] utilized an SVM-based filter for Microsoft
worst performance out of all. It was seen that stemming
Outlook to identify commercial e-mail. Classification
did not affect performance, however employing stopping
models for spam and ham messages were built by the
procedure yielded better performance. They concluded
SVM using personal and impersonal dictionaries. Both
that combining the different techniques seemed a very
yielded identical results attaining a best accuracy of
promising prospect. [Lai, 2007] has made a similar com-
96.69%. [Rios and Zha, 2004] experimented with SVMs
parative study on three commonly used algorithms in
and Random Forests (RFs) and compared them against
Machine Learning NB, k-NN and SVMs. From exper-
Naive Bayes models. They concluded that SVM and
imental results, NB and SVM were found to perform
RF classifiers were equivalent, and that the RF classi-
better than k-NN. [Youn and Mcleod, 2006] and [Yu
fier had greater robustness at low false positive (FP)
and Xu, 2008] noted similar experiments with four ma-
rates; they both outperformed Naive Bayes models at
chine learning algorithm each. [Seewald, 2007] investi-
low FP rates. [Tseng and Chen, 2009] proposed a com-
gated the simple Naive Bayes learner represented by
plete spam detection system MailNET, which is an in-
SpamBayes, and two variants of Naive Bayes learning,
cremental SVM model on dynamic e-mail social net-
SA-Train and CRM-114. SA-Train incorporated back-
works. Although SVMs provide high accuracy for spam
ground knowledge made up of rules while CRM-114
filtering, they have been generally associated with high
considered multi-word phrases and their probability es-
computational cost and some expensive false positive
timates. It was seen that all three systems performed
errors, hence, few solutions were offered, e.g. Online
equally well and the addition of background knowledge
SVMs [Sculley and Wachman, 2007], Ensemble of SVMs
to SA-Train and the extended description language in
[Blanco et al, 2007], etc. A detailed study of various
the case of CRM-114 considering multi-word phrases
distance-based kernels and spam filtering behaviors em-
failed to improve Bayesian learning significantly. Spam-
ploying SVM is found in [Amayri and Bouguila, 2010].
Bayes offered the most stable performance and deteri-
orated least over time. [Almeida et al, 2010] reported
that probabilistic approaches like Bayesian classifica- 5.3 Clustering Techniques
tion suffer from the ‘curse of dimensionality’. They ver-
ified how dimensionality reduction influences the accu- Clustering is the task of grouping a set of patterns into
racy of Nave Bayesian spam filters. similar groups. Clustering techniques have been widely
Machine Learning for E-mail Spam Filtering: Review, Techniques and Trends 17

studied and used in a variety of application domains. Bagging (or bootstrap aggregating) is an ensemble
Spam filtering datasets often have true labels available meta-learning algorithm that is usually applied to deci-
and clustering algorithms, being unsupervised learning sion tree methods, e.g. Random Forest algorithm is an
tools are not always closely related with true labelings. ensemble technique for decision trees that is known to
However given suitable representations, most clustering achieve very high classification accuracy. [Biggio et al,
algorithms can partion e-mail spam datasets into ham 2011] employed bagging ensembles to exploit against
and spam clusters. This was demonstrated by [Whissell poisoning attacks on spam filters. Random forests have
and Clarke, 2011] in a novel investigation of e-mail spam also been used in the spam detection model described
clustering. The results were surprisingly significant as in [Debarr and Wechsler, 2009] and [Lee et al, 2010b].
their clustering based approach bettered those of pre- Boosting [Biggio et al, 2011] involves algorithms that
viously published state-of-the-art semi-supervised ap- build a single strong learner from a set of weak learn-
proaches, hence proving that clustering can be a pow- ers. AdaBoost is the most common implementation of
erful tool for e-mail spam filtering. Boosting. Boosting for filtering of spam messages was
Prior to this, Sasaki and Shinnou [Sasaki and Shin- first reported by [Carreras and Marquez, 2001]. [An-
nou, 2005] had proposed spam detection technique mak- droutsopoulos et al, 2006] compared four most promis-
ing use of the text clustering through a vector space ing learning algorithms from earlier work - LogitBoost,
model. [Basavaraju and Prabhakar, 2010] presented an Nave Bayes, Flexible Bayes and linear SVM. The au-
effective clustering algorithm integrating K-means and thors studied the role of attributes characterizing n-
BIRCH algorithm features. K-means algorithm worked grams frequencies and explored the effect of attribute
well for small scale data sets. BIRCH with K-Nearest size and training set in a cost-sensitive framework con-
Neighbour Classifier (K-NNC) was found to be the ideal text. Using evaluation measures as in [Androutsopou-
combination as it performed better with large data sets. los et al, 2000b], and the PU1 corpus in experiments,
[Debarr and Wechsler, 2009] relied on using term fre- [Carreras and Marquez, 2001] proved the definite effect
quency and inverse document frequency representation of boosting in decision-tree filters. Methods based on
for e-mails and employed the Partitioning Around Medoids boosting outperformed Naive Bayes and Decision Trees
(PAM) clustering algorithm to cluster a uniform sample algorithms when tested on the PU1 corpus. [Sakkis et al,
of 25% of messages in the training pool. Clustering com- 2001] experimented with combining a memory-based
bined with Random Forests for classification and active classifier with a Naive Bayes filter with another memory-
learning for refinement produced the best Area Under based classifier as president in a stacking framework.
Curve (AUC) of 95.2%. These works conclude that em- They achieved impressive precision and recall and con-
ploying the ham/spam clusters is a effective method for cluded that stacking consistently raises the performance
spam detection and because a ham/spam split is a nat- of the overall filter. He and Thiesson [He and Thiesson,
ural clustering for an e-mail spam dataset, clustering 2007] proposed a new asymmetric boosting method -
techniques should be investigated further as a tool for Boosting with Different Costs and applied it to spam fil-
more robust content based spam filters. tering. [Neumayer, 2006], [Shi et al, 2012], and [Blanco
et al, 2007] also discuss the application of an ensemble
learning to spam filtering.
5.4 Ensemble Classifiers
6 Evaluation Measures and Benchmarks
Ensemble learning is a novel technique where a set of in-
dividual classifiers are trained and brought together to Ideally spam filters should be evaluated on large, pub-
enhance the classification accuracy of the overall system licly available spam and ham databases. Sometimes Ac-
on the same problem (spam detection). An ensemble of curacy (Acc), the ratio of messages correctly classifies
classifiers is very effective for classification tasks and of- is used as an integrated measure for performance. If NL
fers good generalization. Spam filters have to deal with and and NS signify the number of legitimate messages
a diversity of spams, so it needs to continually evolve in and spam messages to be classified, then we define Ac-
order to detect new types of spam (future spam), and curacy (Acc) and Error (Err) of the spam filter as -
at the same time not allow ‘classical’ spam to evade
|L→L|+|S→S| |L→S|+|S→L|
the filter. Therefore, [Guerra et al, 2010] had suggested Acc = NL +NS and Err = 1 - Acc = NL +NS
that combining old and new filters (e.g. using ensem-
ble classifiers) may be an interesting strategy to deal Accuracy and Error consider both False Positive
with the diversity of spams. The most popular ensem- |L → S| and False Negative |S → L| events to carry
ble classifiers are bagging and boosting. equal cost. However, spam filtering involves asymmetric
18 Alexy Bhowmick, Shyamanta M. Hazarika

error costs. Failing to identify a ham, i.e. misclassifying False positives are considerably more expensive (λ
a ham as spam (a False Positive event) is generally a times) when compared with false negatives [Androut-
costlier mistake than missing a spam (a False Negative sopoulos et al, 2000a] [Androutsopoulos et al, 2006].
event). For e.g. A business letter from the boss or a per- Here, λ is a parameter that specifies how ‘dangerous’
sonal message from a spouse quarantined (and delayed) or ‘costly’ it is to misclassify legitimate e-mail as spam
or deleted can lead to serious consequences, while seeing and reflects the extra effort it requires from the user
a spam in our inbox may cause just a slight irritation. to recover from failures of the filter. For many users
True Positive event |L → L| is when a ham e-mail is false positives are unacceptable. [Androutsopoulos et al,
correctly classified as ham. True Negative event |S → S| 2006] suggested this cost sensitivity be taken into ac-
is when a spam e-mail is correctly classified as spam. count by treating each legitimate message to be equal
With this in mind, the False Positive Rate (FPR) - the to λ messages. Cost-sensitive measures Weighted Accu-
proportion of legitimate e-mails identified as spam is racy (WAcc), Weighted Error Rate (WErr) and Total
represented as - Cost Ratio (TCR) [Clark, 2008] are used as shown in
the formula.
#of F alseP ositives
FPR = #of F alseP ositives+#of T rueN egatives
λ|L→L|+|S→S|
WAcc = NL +NS and WErr = 1 - WAcc =
Again, failing to identify spam e.g. e-mails contain- λ|L→S|+|S→L|
NL +NS
ing viruses, worms, or phishing baits as payload can
incur significant risks to the user. False Negative Rate
NS
(FNR) i.e. the proportion of spam messages that were TCR = λ|L→S|+|S→L|
classified as legitimate, is another suitable measure.
The Total Cost Ratio is used to compare the effec-
#of F alseN egatives
FNR = #of T rueP ositives+#of F alseN egatives tiveness of a filter for a given λ when compared with a
baseline setting [Guzella and Caminhas, 2009]. It is an
Superior spam classifiers have lower FPR and FNR. evidence of the improvement brought about by the fil-
The two-dimensional quantity (FNR, FPR) denotes the ter. This cost-sensitive evaluation uses the λ parameter
effectiveness of hard classifiers while the effectiveness of to adjust the weight of a false positive. There are three
soft classifiers may be denoted by a set of such pairs values for λ used commonly in spam literature, λ = 1,
defining a curve - an ROC (Receiver Operating Char- 9, 999 [Androutsopoulos et al, 2000b], [Androutsopou-
acteristics) curve. ROC analysis are an excellent per- los et al, 2000a], [Androutsopoulos et al, 2006], [Sakkis
formance metric in spam filtering. A spam filter whose et al, 2001] and [Clark, 2008]. These values represent the
ROC curve strictly lies above that of another is the bet- situations when a false positive equals a false negative,
ter filter in all deployment scenarios. [Cormack, 2008]. or a false positive is 9 times a costlier mistake than a
Two measures borrowed from Information Retrieval false negative, or 999 times costlier. Greater TCR values
‘Recall ’ and ‘Precision’ are often used for capturing indicate superior performance. F-measure or F-score is
the effectiveness and quality of spam filters respectively another combining measure that combines both Preci-
[Androutsopoulos et al, 2000a]. If |S → L| signifies the sion (Ps ) and Recall (Rs ) metrics in one equation. It
number of spam messages classified as legitimate, and can be interpreted as the weighted harmonic mean of
|S → S| signifies the number of legtimate messages clas- both.
sified as spam respectively, and likewise for |L → L|
and |L → S| then Spam Recall (Rs ) and Spam Preci- F − measure = 2.P recision.Recall
P recision+Recall
sion (Ps ) are defined by the equations:

|S→S| |S→S|
Rs = |S→S|+|S→L| and Ps = |S→S|+|L→S|

Recall (Rs ) is a measure of the number of spam


messages successfully blocked by the filter (i.e. its ef- 6.1 Publicly Available Benchmark Datasets
fectiveness), while Precision (Ps ) measures the number
of the messages classified as spam by the filter that were Most of the datasets publicly available are static datasets
indeed spam (i.e. its quality or safety) [Androutsopou- with very few concept drift datasets. Many authors con-
los et al, 2006] [Sakkis et al, 2001]. Comparing spam struct their own image spam or phishing corpus. Table
filters based on (Rs ) and (Ps ) is tricky despite with 4 below lists public corpora with associated information
each configuration giving (Rs ) and (Ps ) values. used in spam filtering experiments.
Machine Learning for E-mail Spam Filtering: Review, Techniques and Trends 19

Table 4 Publicly available benchmark datasets on E-mail Spam.

Number of Messages/Images
Corpus Name Spam Rate Year of Creation Reference/Used
Spam Ham

SpamAssassin 1897 4150 31% 2002 [Méndez et al, 2006]

Enron-Spam 13,496 16,545 - 2006 [Koprinska et al, 2007]

LingSpam 481 2412 17% 2000 [Sakkis et al, 2001]

PU1 481 618 44% 2000 [Attar et al, 2011]

PU2 142 579 20% 2003 [Zhang et al, 2004]

PU3 1826 2313 44% 2003 [Zhang et al, 2004]

PUA 571 571 50% 2003 [Zhang et al, 2004]

Gen Spam 41,404 78% 2005 [Cormack and Lynam, 2007]

Spambase 1813 2788 39% 1999 [Sakkis et al, 2001]

ZH1 1205 428 74% 2004 [Zhang et al, 2004]

TREC 2005 52,790 39,399 - 2005 [Androutsopoulos et al, 2000a]

TREC 2006 24,912 12,910 - 2006 [Androutsopoulos et al, 2000c]

TREC 2007 50,199 25,220 - 2007 [Debarr and Wechsler, 2009]

Spam Archive >2,20,000 100% 1998 [Almeida and Yamakami, 2012]

Biggio 8549 0 - 2005 [Biggio et al, 2006]

Princeton Spam Image


1071 0 - - [Biggio et al, 2006]
Benchmark

SpamArchive >2,20,000 100% 1998 [Almeida and Yamakami, 2012]

Dredze Image Spam


3927 2006 - 2007 [Almeida and Yamakami, 2012]
Dataset

Phishing Corpus 415 0 - 2005 [Abu-nimeh et al, 2007]

7 Future Challenges and Conclusion 7.1 Handling Concept Drift

In the actual world, concepts change over time in unan-


Spam filtering is an ‘arms race’ marked by an increase ticipated ways and are therfore hard to predict. Changes
in the sophistication in spam construction techniques in the statistical properties of context can lead to a
as well as spam filtering techniques [Goodman et al, change in the target variable or concept. Concept drift
2007]. Characterization and measurement studies have is distinguished in literature as ‘sudden’ and ‘gradual’
been developed in content-based spam filtering [Pu and [Tsymbal et al, 2008]. For e.g., a student graduating
Webb, 2006], [Blanzieri and Bryl, 2008]. The evolution from college might all of a sudden develop financial con-
of the infrastructure used by spammers to disseminate cerns, whereas, in a biomedical context, pathogen sen-
spams over the network is seen in their migration from sitivity may gradually evolve with the passage of time
simple obfuscation techniques, to image spam and to as bacterial pathogens develop immunity to antibiotics
compromised machines. The dynamic nature of spam that used to be effective earlier. Hidden changes in con-
and the reactivity of spammers make e-mail spam fil- text affects not just the target concept but also causes
tering an active research area. E-mail spam filtering will an alteration in the underlying data distributions [De-
remain a persistent problem and some of the most inter- lany et al, 2005], making the learning task increasingly
esting challenges in the future of e-mail spam filtering complicated and requiring special approaches. Models
could be - built on old data become less accurate or inconsistent
20 Alexy Bhowmick, Shyamanta M. Hazarika

making the rebuilding of the model imperative (called spam analysis where much work needs to be been done
virtual concept drift). Spam filtering is a dynamic prob- on leveraging existing algorithms.
lem that involves concept drift. While the understand-
ing of an unwanted message may remain the same, the
statistical properties of the spam e-mail changes over
time since it is driven by spammers involved in a never- 7.3 Emerging Spam Threats
ending arms race with spam filters. Another reason for
concept drift could be the different products or scams One of the biggest spam problems today even as spam
driven by spam that tend to become popular. The dy- e-mail volumes associated with botnets are receding
namic nature of spam is one of its most testing aspects. is the snowshoe spam. Showshoe spamming is a tech-
An effective spam filter must be able to track target nique that uses multiple IP addresses, websites and
concept drift and swiftly adapt to it. Research on con- sub-networks to send spam, so as to avoid detection
cept drift confirms lazy learning techniques to be the by spam filters. The term ‘snowshoe’ spam describes
most effective models against concept drift [Tsymbal, how some spammers distribute their load across a larger
2004], [Tsymbal et al, 2008]. Most of the earlier evalua- surface to keep from sinking, just as snowshoe wear-
tions did not try to deal with concept drift, or with real- ers do [McAfee, 2012] [Sophos, 2013]. Social networks
world datasets that have some concept drift. Few au- have also become a hunting ground for spammers. With
thors tried to address concept drift in spam filtering us- many users migrating to social networks as a means of
ing Case-Base Reasoning [Delany et al, 2005], Instance- communication, spammers are diversifying in order to
Based Reasoning [Fdez-Riverola et al, 2007b], Ensem- stay in business. The personal information revealed in
ble Learning [Tsymbal et al, 2008], Language Model social networks is gleaned by spammers to target un-
technique [Hayat et al, 2010]. A particular challenge in suspecting victims with tailored e-mails.
handling concept drift is in distinguishing between true
concept drift and noise. Research in concept drift is a
very active area in spam filtering.
7.4 Prioritising E-mails

E-mail prioritization is an urgent research area with not


7.2 Eliminating False Positives much research done. In addition to basic communica-
tion, our e-mails are ‘overloaded’ in the sense of being
Spam filtering is often viewed as a straight text catego- used for a wide variety of other tasks - communica-
rization problem. But e-mail is not just text, it also tion, advertisements, reminders, contact management,
has structure, hence in reality it turns out to be a task management, and cloud storage. There is a seri-
more complicated problem than straightforward classi- ous need to address the information overload issue by
fication. One complication arises from the cost-sensitivity developing systems that can learn personal priorities
associated with the spam filtering problem. The cost of from data and identify important e-mails for each user.
inadvertently restricting a ham message is more than Prioritizing e-mail as per its importance is another de-
that of a spam message evading the filter (see section 6). sirable characteristic in a spam filter. Prioritizing e-mail
Such mislabeling of e-mail is completely unacceptable or perhaps redirecting urgent messages to handheld de-
to users as it can lead to loss of important information vices could be another way of managing e-mails [Ko-
or even more serious consequences. Moreover, in this prinska et al, 2007]. Learning to prioritize or rank is
case the user has to review the messages sorted out a relatively new field in which Machine Learning algo-
to the spam folder and it somehow defeats the whole rithms are used to learn some ranking function. [Dredze
purpose of spam filtering [Tretyakov, 2004]. Content- et al, 2009] and [Aberdeen and Slater, 2011] are signif-
based spam filtering systems, though widely adopted as icant works on ranking algorithms for proposing useful
a successful spam defense strategy, has unfortunately filters that rapidly filter groups of inbox messages and
substituted the spam issue with a false positive one. search messages more easily. However importance rank-
Such systems achieve a high accuracy but there exists ing is harder than it seems as often users disagree on
some false positive tradeoff. False positives are more what is important, requiring a high degree of person-
severe and expensive than spam. Although significant alization. The result is the growth of one of the most
attempts e.g. Reliable e-mail [Garriss et al, 2006] have challenging research areas in Machine Learning i.e. Per-
been made, nevertheless, to make e-mail reliable, spam sonalized e-mail prioritization [Yang et al, 2010], which
filters must reduce the incidences of false positives. Re- rely mostly on the analysis of social networks to model
duction of false positives is another domain in e-mail user priorities among incoming e-mail messages.
Machine Learning for E-mail Spam Filtering: Review, Techniques and Trends 21

8 Conclusion Spam Filtering. In: The Sixth International Confer-


ence on Digital Society, c, pp 140–145
Future researches must address the fact that e-mail Almeida TA, Yamakami A (2010) Content-Based Spam
spam filtering is a co-evolutionary problem, since as Filtering. In: The 2010 International Joint Confer-
the filter attempts to extend its predictive accuracy, ence on Neural Networks (IJCNN), Barcelona, pp 1–
the spammers attempt to outdo the classifiers. Hence, 7
an effective approach should find a successful mecha- Almeida TA, Yamakami A (2012) Advances in Spam
nism to identify the drift or evolution in spam features. Filtering Techniques. In: Computational Intelligence
Among all the traditional approaches discussed so far, for Privacy and Security, Springer Berlin Heidelberg,
the single approach that has achieved tremendous suc- pp 199–214
cess against spam is content-based spam filtering. For- Almeida Ta, Almeida J, Yamakami A (2010) Spam
tunately, machine learning-based systems enable sys- Filtering: How the Dimensionality Reduction Af-
tems to learn and adapt to new threats, reacting to fects the Accuracy of Naive Bayes Classifiers.
counteractive measures adopted by spammers. Journal of Internet Services and Applications
No single anti-spam solution may be the right an- 1(3):183–200, DOI 10.1007/s13174-010-0014-7, URL
swer. A multi-faceted approach that combines legal and http://www.springerlink.com/index/10.1007/
technical solutions and more is likely to provide a death s13174-010-0014-7
blow to such spam. Without an effective solution spam Amayri O, Bouguila N (2010) A Study of Spam
will only continue to decrease the value of an efficient Filtering using Support Vector Machines. Artifi-
communication medium. As long as spam exists it will cial Intelligence Review 34(1):73–108, DOI 10.1007/
continue to have adverse effects on the preservation of s10462-010-9166-x, URL http://link.springer.
integrity of e-mails and the user’s perception on the ef- com/10.1007/s10462-010-9166-x
fectiveness of spam filters. We reviewed content-based Andreolini M, Bulgarelli A, Colajanni M, Mazzoni F
spam filtering techniques based on Machine Learning (2005) HoneySpam : Honeypots Fighting Spam at
methods propounded so far, highlighting the main ap- the Source. In: Proceedings of the Steps to Reducing
proaches and advancements gained by the approach. A Unwanted Traffic on the Internet Workshop, Cam-
quantitative analysis of the major reviews over the last bridge, MA, pp 77–83
decade was conducted. Overall the number and qual- Androutsopoulos I, Koutsias J, Chandrinos KV,
ity of literature demonstrates that remarkable advance- Paliouras G, Spyropoulos CD (2000a) An Evaluation
ments have been achieved and continue to be achieved. of Naive Bayesian Anti-Spam Filtering. In: Proceed-
However some outstanding problems in e-mail spam ings of 11th European Conference on Machine Learn-
filtering as highlighted above still remain. Till more ing (ECML 2000), Barcelona, pp 9–17
improvements in spam filtering happen, anti-spam re- Androutsopoulos I, Koutsias J, Chandrinos KV,
search will remain an active research area. Paliouras G, Spyropoulos CD (2000b) Learning to
Filter Spam E-Mail : A Comparison of a Naive
Bayesian and a Memory based Approach. In: Pro-
References
ceedings of 4th European Conference on Principles
and Practice of Knowledge Discovery in Databases,
(2010) An Analyst Review of Hotmail Anti-Spam Tech-
Lyon, France, September 2000, pp 1–12
nology. A White Paper. Tech. rep., The Radicati
Androutsopoulos I, Koutsias J, Chandrinos KV, Spy-
Group Inc, URL www.radicati.com
ropoulos CD (2000c) An Experimental Comparison
(2014) Stanford Anti-Phishing Browser Exten-
of Naive Bayesian and Keyword-Based Anti-Spam
sions. URL https://crypto.stanford.edu/
Filtering with Personal E-mail Messages. In: SIGIR
antiphishing/
’00 Proceedings of the 23rd Annual International
Aberdeen D, Slater A (2011) The Learning Behind
ACM SIGIR Conference on Research and Develop-
Gmail Priority Inbox. In: NIPS 2010 Workshop on
ment in Information Retrieval, pp 160–167
Learning on Cores, Clusters and Clouds, pp 3–6
Androutsopoulos I, Paliouras G, Michelakis E (2006)
Abu-nimeh S, Nappa D, Wang X, Nair S (2007) A Com-
Learning to Filter Unsolicited Commercial E-Mail.
parison of Machine Learning Techniques for Phishing
Tech. rep., National Centre for Scientific Research
Detection. In: eCrime 07: Proceedings of the Anti-
Demokritos, Athens, Greece
phishing Working Groups 2nd Annual eCrime Re-
Anti-Phishing Working Group (APWG) (2014) APWG
searchers Summit, New York,USA, pp 60–69
Phishing Activity Trends Report, 2nd Quarter 2014.
Al-jarrah O, Khater I, Al-duwairi B (2012) Identifying
Tech. Rep. June, Anti-Phishing Working Group
Potentially Useful Email Header Features for Email
22 Alexy Bhowmick, Shyamanta M. Hazarika

(APWG), URL http://docs.apwg.org/reports/ URL http://www.mecs-press.org/ijcnis/


apwg_trends_report_q2_2014.pdf ijcnis-v4-n10/v4n10-7.html
Attar A, Rad RM, Atani RE (2011) A Survey of Carpinter J, Hunt R (2006) Tightening the Net : A Re-
Image Spamming and Filtering Techniques. Artifi- view of Current and Next Generation Spam Filtering
cial Intelligence Review 40(1):71–105, DOI 10.1007/ Tools. Computers and Security 25(8):566–578
s10462-011-9280-4, URL http://link.springer. Carreras X, Marquez L (2001) Boosting Trees for Anti-
com/10.1007/s10462-011-9280-4 Spam Email Filtering p 7, URL http://arxiv.org/
Baayen H, Halteren HV, Neijt A, Tweedie F (2002) An abs/cs/0109015, 0109015
Experiment in Authorship Attribution. In: Proceed- Caruana G, Li M (2012) A Survey of Emerg-
ings of JADT 2002: Sixth International Conference ing Approaches to Spam Filtering. ACM Com-
on Textual Data Statistical Analysis, pp 29–37 puting Surveys 44(2):1–27, DOI 10.1145/2089125.
Basavaraju M, Prabhakar R (2010) A Novel Method 2089129, URL http://dl.acm.org/citation.cfm?
of Spam Mail Detection using Text Based Clustering doid=2089125.2089129
Approach. International Journal of Computer Appli- Chandrasekaran M, Narayanan K, Upadhyaya S (2006)
cations 5(4):15–25 Phishing E-mail Detection Based on Structural Prop-
Basnet R, Mukkamala S, Sung AH (2008) Detection of erties. In: Proceedings of the NYS Cyber Security
Phishing Attacks : A Machine Learning Approach. Conference, Albany, NY, pp 2–8
In: Studies in Fuzziness and Soft Computing, pp 373– Chih-Chin Lai, Tsai MC (2004) An Empirical Perfor-
383 mance Comparison of Machine Learning Methods
Bergholz A, Beer JD, Glahn S (2010) New Filtering for Spam E-mail Categorization. In: Fourth Interna-
Approaches for Phishing Email. Journal of Computer tional Conference on Hybrid Intelligent Systems, HIS
Security 18:7–35 2004, pp 0–4
Beverly R, Sollins K (2008) Exploiting Transport-Level Chirita PA, Diederich J, Nejdl W (2009) MailRank :
Characteristics of Spam. In: 5th Conference on Email Using Ranking for Spam Detection. In: Proceedings
and Anti-Spam (CEAS), Mountain View, CA of the 14th ACM International Conference on Infor-
Biggio B, Fumera G, Pillai I, Roli F (2006) A Sur- mation and Knowledge Management, CIKM 2005, pp
vey and Experimental Evaluation of Image Spam 373–380
Filtering Techniques. Pattern Recognition Letters Chou N, Ledesma R, Teraguchi Y, Boneh D, Mitchell
32(10):1436–1446 JC (2004) Client-side Defense Against Web-based
Biggio B, Corona I, Fumera G, Giacinto G, Roli F Identity Theft. In: Proc. 11th Annual Network and
(2011) Bagging Classifiers for Fighting Poisoning At- Distributed System Security Symposium (NDSS 04)
tacks in Adversarial Classification Tasks. In: Multi- CISCO (2007) Botnets : The New Threat Landscape.
ple Classifier Systems, Springer Berlin Heidelberg, pp Tech. rep., URL http://www.cisco.com/c/en/us/
350–359 solutions/collateral/enterprise-networks/
Blanco A, Ricket AM, Martn-Merino M (2007) Com- threat-control/networking_solutions_
bining SVM Classifiers for Email Anti-spam Filter- whitepaper0900aecd8072a537.pdf
ing. In: Computational and Ambient Intelligence, CISCO (2014) Cisco 2014 Annual Security Report.
Springer Berlin Heidelberg, pp 903–910 Tech. rep., CISCO, URL http://www.cisco.com/
Blanzieri E, Bryl A (2008) A Survey of Learning-Based web/offer/gist_ty2_asset/Cisco_2014_ASR.pdf
Techniques of Email Spam Filtering. Journal Artifi- Clark KP (2008) A Survey of Content-based Spam
cial Intelligence Review 29(1):63–92 Classifiers. pp 1–19
Boykin PO, Roychowdhury VP (2005) Leveraging So- Cohen WW (1996) Learning Rules that Classify. In:
cial Networks to Fight Spam. Computer 38(4):61–68 Spring Symposium on Machine Learning in Informa-
Brien CO, Vogel C (2003) Spam Filters : Bayes vs . tion Access,, pp 18–25
Chi-squared ; Letters vs . Words. In: ISICT 03: Pro- Cormack GV (2008) Email Spam Filtering: A System-
ceedings of the First International Symposium on In- atic Review. Foundations and Trends in Information
formation and Communication Technologies, Dublin: Retrieval 1(4):335–455, DOI 10.1561/1500000006,
Trinity College, September URL http://www.nowpublishers.com/article/
Bringer ML, Chelmecki CA, Fujinoki H (2012) Details/INR-006
A Survey: Recent Advances and Future Trends Cormack GV, Lynam TR (2007) On-line Supervised
in Honeypot Research. International Journal Spam Filter Evaluation. ACM Transactions on In-
of Computer Network and Information Secu- formation Systems (TOIS) 25(3)
rity 4(10):63–75, DOI 10.5815/ijcnis.2012.10.07,
Machine Learning for E-mail Spam Filtering: Review, Techniques and Trends 23

Cranor LF, Lamacchia BA (1998) Spam! Communica- Fumera G (2006) Spam Filtering Based On The Anal-
tions of the ACM 41(8) ysis Of Text Information Embedded Into Images.
Crocker D (2009) Internet Mail Architecture - RFC Journal of Machine Learning Research (special issue
5598. Tech. rep., URL https://tools.ietf.org/ on Machine Learning in Computer Security) 7:2699–
html/rfc5598 2720
Cyberoam (2014) Internet Threats Trend Report 2014. Gansterer WN, Ecker GF (2008) On the Relationship
Tech. Rep. April, Cyberoam Between Feature Selection and Classification Accu-
Debarr D, Wechsler H (2009) Spam Detection using racy. Journal of Machine Learning Research 4:90–105
Clustering , Random Forests , and Active Learning. Garriss S, Kaminsky M, Freedman MJ, Karp B,
In: CEAS 2009 Sixth Conference on Email and Anti- Mazières D, Yu H (2006) RE : Reliable Email. In:
Spam NSDI’06 Proceedings of the 3rd Conference on Net-
Delany SJ, Bridge D (2006) Feature based and Feature worked Systems Design & Implementation, pp 22–22
free Textual CBR : a Comparison in Spam Filtering. Golbeck J, Hendler J (2004) Reputation Network Anal-
In: Proceedings of the 17th Irish Conference on Arti- ysis for Email Filtering Creating the Reputation Net-
ficial Intelligence and Cognitive Science (AICS ’06), work. In: Proceedings of the First Conference on
pp 244–253 Email and Anti-Spam, Mountain View, California.
Delany SJ, Cunningham P, Tsymbal A, Coyle L Gomez JC, Boiy E, Moens MF (2012) Highly Dis-
(2005) A Case-based Technique for Tracking Con- criminative Statistical Features for Email Classifica-
cept Drift in Spam Filtering. Knowledge-Based Sys- tion. Knowledge and Information Systems 31(1):23–
tems 18(4-5):187–195, DOI 10.1016/j.knosys.2004. 53, DOI 10.1007/s10115-011-0403-7, URL http://
10.002, URL http://linkinghub.elsevier.com/ link.springer.com/10.1007/s10115-011-0403-7
retrieve/pii/S0950705105000316 Goodman BJ, Cormack GV, Heckerman D (2007) Spam
Diao Y, Lu H, Wu D (2003) A Comparative Study and the Ongoing Battle for the Inbox. Communica-
of Classification Based Personal E-mail Filtering. In: tions of the ACM 50(2):24–33
Knowledge Discovery and Data Mining. Current Is- Graham P (2002a) A Plan for Spam. URL http://
sues and New Applications, pp 408–419 www.paulgraham.com/spam.html
Dredze M, Schilit BN, Norvig P (2009) Suggesting Graham P (2002b) Will Filters Kill Spam? URL http:
Email View Filters for Triage and Search. In: Pro- //www.paulgraham.com/wfks.html
ceedings of International Joint Conference on Artifi- Graham P (2003) Better Bayesian Filtering. URL
cial Intelligence (IJCAI), pp 1414–1419 http://www.paulgraham.com/better.html
Drucker H, Wu D, Vapnik VN (1999) Support Vector Graham-Cumming J (2004) How to Beat an Adaptive
Machines for Spam Categorization. IEEE Transac- Spam Filter. In: The Spam Conference
tions on Neural Networks 10(5):1048–1054 Graham-Cumming J (2006) Does Bayesian Poi-
Fawcett T (2004) ”In vivo” Spam Filtering: A Chal- soning Exist? Virus Bulletin URL https:
lenge Problem for Data Mining. SIGKDD Explo- //www.virusbtn.com/spambulletin/archive/
rations 5(2):140–148, URL http://arxiv.org/abs/ 2006/02/sb200602-poison.dkb?url=/archive/
cs/0405007, 0405007 2006/02/sb200602-poison
Fdez-Riverola F, Iglesias E, Dı́az F, Méndez J, Cor- Greenberg A (2010) The Most Common
chado J (2007a) Applying Lazy Learning Algorithms Words In Spam Email. URL http://www.
to Tackle Concept Drift in Spam Filtering. Ex- forbes.com/sites/firewall/2010/03/17/
pert Systems with Applications 33(1):36–48, DOI 10. the-most-common-words-in-spam-email/
1016/j.eswa.2006.04.011, URL http://linkinghub. Guerra PHC, Guedes D, Jr WM, Hoepers C, Chaves
elsevier.com/retrieve/pii/S0957417406001175 MHPC, Steding-jessen K (2010) Exploring the Spam
Fdez-Riverola F, Iglesias E, Dı́az F, Méndez J, Cor- Arms Race to Characterize Spam Evolution. In:
chado J (2007b) SpamHunting: An Instance-based CEAS 2010 - Seventh Collaboration, Electronic mes-
Reasoning System for Spam Labelling and Filtering. saging, Anti-Abuse and Spam Conference, Redmond,
Decision Support Systems 43(3):722–736, DOI 10. Washington USA
1016/j.dss.2006.11.012, URL http://linkinghub. Guyon I (2003) An Introduction to Variable and Fea-
elsevier.com/retrieve/pii/S0167923606002041 ture Selection. Journal of Machine Learning Research
Fette I, Sadeh N, Tomasic A (2007) Learning to Detect 3:1157–1182
Phishing Emails. In: Proceedings of the 16th Inter- Guzella TS, Caminhas WM (2009) A Review of
national Conference on World Wide Web, New York, Machine Learning Approaches to Spam Filtering.
NY, USA, pp 649–656 Expert Systems with Applications 36(7):10,206–
24 Alexy Bhowmick, Shyamanta M. Hazarika

10,222, DOI 10.1016/j.eswa.2009.02.037, URL Jorgensen Z, Zhou Y, Inge M (2008) A Multiple In-
http://linkinghub.elsevier.com/retrieve/ stance Learning Strategy for Combating Good Word
pii/S095741740900181X Attacks on Spam Filters. Journal of Machine Learn-
Gyongyi Z, Garcia-molina H (2005) Web Spam Taxon- ing Research 8:1115–1146
omy. In: 1st International Workshop on Adversarial Kanaris I, Kanaris K, Houvardas I, Stamatatos E
Information Retrieval on the Web (2006) Words vs. Character n-grams for Anti-spam
Hao S, Syed NA, Feamster N, Gray AG, Krasser S Filtering. International Journal on Artificial Intelli-
(2009) Detecting Spammers with SNARE : Spatio- gence Tools XX(X):1–20
temporal Network-level Automatic Reputation En- Kaspersky (2014) Kaspersky Security Bulletin 2014.
gine. In: Proceedings of 18th USENIX Security Predictions 2015. Tech. rep.
Hayat MZ, Basiri J, Seyedhossein L, Shakery A (2010) Katakis I, Tsoumakas G, Vlahavas I (2007) Email Min-
Content-Based Concept Drift Detection for Email ing : Emerging Techniques for Email Management.
Spam Filtering. In: 5th International Symposium on In: Vakali A, Pallis G (eds) Web Data Management
Telecommunications (IST’2010), pp 531–536 Practices: Emerging Techniques and Technologies,
He J, Thiesson B (2007) Asymmetric Gradient Boosting Idea Group Publishing, USA, chap 10
with Application to Spam Filtering. In: Fourth Con- Kiritchenko S, Matwin S, Abu-hakima S (2004) Email
ference on Email and Anti-Spam (CEAS), Mountain Classification with Temporal Features. In: Intelligent
View, California, USA Information Processing and Web Mining, Springer
Hershkop S (2006) Behavior-based Email Analysis with Berlin Heidelberg, pp 523–533
Application to Spam Detection. PhD thesis Kolari P, Java A, Finin T, Oates T, Joshi A (2006) De-
Hershkop S, Stolfo SJ (2005) Identifying Spam Without tecting Spam Blogs : A Machine Learning Approach.
Peeking at the Contents. ACM Crossroads, p 11 In: AAAI’06 Proceedings of the 21st National Con-
Hu Y, Guo C, Ngai E, Liu M, Chen S (2010) ference on Artificial Intelligence, pp 1351–1356
A Scalable Intelligent Non-content-based Spam- Koprinska I, Poon J, Clark J, Chan J (2007)
filtering Framework. Expert Systems with Appli- Learning to Classify E-mail. Information Sci-
cations 37(12):8557–8565, DOI 10.1016/j.eswa.2010. ences 177(10):2167–2187, DOI 10.1016/j.ins.2006.
05.020, URL http://linkinghub.elsevier.com/ 12.005, URL http://linkinghub.elsevier.com/
retrieve/pii/S0957417410004318 retrieve/pii/S0020025506003707
IBM (2012) IBM X-Force 2012 Mid-Year Trend Lai CC (2007) An Empirical Study of Three Machine
and Risk Report. Tech. rep., IBM, URL Learning Methods for Spam Filtering. Knowledge-
http://www.ibm.com/smarterplanet/global/ Based Systems 20(3):249–254, DOI 10.1016/j.knosys.
files/ca__en_us__security__xorce_2012_ 2006.05.016, URL http://linkinghub.elsevier.
midyear_trend_and_risk_report.pdf com/retrieve/pii/S0950705106001390
IBM (2014) IBM X-Force Threat Intelligence Quar- Lee K, Caverlee J, Webb S (2010a) Uncovering Social
terly, 4Q 2014. Tech. Rep. November, IBM, URL Spammers : Social Honeypots + Machine Learning.
http://public.dhe.ibm.com/common/ssi/ecm/ In: Proc. of 33rd Int. ACM SIGIR Conf. on Re-
wg/en/wgl03062usen/WGL03062USEN.PDF search and Development in Information Retrieval,
Irwin B, Friedman B (2008) Spam Construction Trends. New York, NY, USA, pp 435–442
In: Information Security for South Africa (ISSA), pp Lee SM, Kim DS, Kim JH, Park JS (2010b)
1–12 Spam Detection Using Feature Selection and Pa-
Isacenkova J, Balzarotti D (2011) Measurement and rameters Optimization. 2010 International Confer-
Evaluation of a Real World Deployment of a ence on Complex, Intelligent and Software Inten-
Challenge-Response Spam Filter. In: Proceedings of sive Systems (i):883–888, DOI 10.1109/CISIS.2010.
the 2011 ACM SIGCOMM Conference on Internet 116, URL http://ieeexplore.ieee.org/lpdocs/
Measurement Conference (IMC ’11), pp 413–426 epic03/wrapper.htm?arnumber=5447486
John JP, Moshchuk A, Gribble SD, Krishnamurthy A Leiba B, Ossher J, Segal R, Wegman M (2005) SMTP
(2009) Studying Spamming Botnets Using Botlab. Path Analysis. In: Proceedings of Second Conference
In: In USENIX Symposium on Networked Systems on Email and Anti-Spam, CEAS ’2005
Design and Implementation (NSDI) Li F, Hsieh Mh, Gburzynski P (2007) The Community
Jonathan B Postel (1982) Simple Mail Transfer Proto- Behavior of Spammers. URL http://web.media.
col - RFC 281. Tech. rep., URL https://www.ietf. mit.edu/~fulu/ClusteringSpammers.pdf
org/rfc/rfc0821.txt Liu C, Stamm S (2007) Fighting Unicode-Obfuscated
Spam
Machine Learning for E-mail Spam Filtering: Review, Techniques and Trends 25

Lowd D, Meek C (2005) Good Word Attacks on Sta- Wisconsin, pp 95–98


tistical Spam Filters. In: Proceedings of the Second Perera KS, Neupane B, Faisal MA, Aung Z, Woon
Conference on Email and Anti-Spam (CEAS) WL (2013) A Novel Ensemble Learning-Based Ap-
Ludl C, McAllister S, Kirda E, Kruegel C (2007) On proach for Click Fraud Detection in Mobile Adver-
the Effectiveness of Techniques to Detect Phishing tising. In: Mining Intelligence and Knowledge Explo-
Sites. In: DIMVA 07: Proceedings of the 4th Inter- ration, Springer International Publishing, pp 370–382
national Conference on Detection of Intrusions and Pu C, Webb S (2006) Observed Trends in Spam Con-
Malware, and Vulnerability Assessment, Springer- struction Techniques: A Case Study of Spam Evo-
Verlag., Berlin, Heidelberg, pp 20–39 lution. In: Proceedings of third conference on e-mail
Ludlow M (2002) Just 150 spammers Blamed for E-mail and anti-spam (CEAS), vol 6, pp 0–8
Woe. The Sunday Times Radicati (2016) Email Statistics Report, 2012-2016 -
M Sahami, S Dumais, D Heckerman, , E Horvitz (1998) Executive Summary. Tech. Rep. 650, Radicati
A Bayesian Approach to Filtering Junk E-Mail. In: Ramachandran A, Feamster N (2006) Understanding
15th National Conference on Artificial Intelligence, the Network-Level Behavior of Spammers. In: Pro-
Madison, WI, USA, Cohen, pp 55–62 ceedings of ACM SIGCOMM
McAfee (2012) Snowshoe Spamming Emerges as Rios G, Zha H (2004) Exploring Support Vector Ma-
Threat to Email Security. URL http://www. chines and Random Forests for Spam Detection. In:
mcafee.com/in/security-awareness/articles/ Conference on e-mail and anti-spam (CEAS), pp 5–
snowshoe-spamming-biggest-problem.aspx 10
McMillan R (2008) 100 E-mail Bouncebacks? You’ve Robinson G (2003) A Statistical Approach to the Spam
Been Backscattered. URL http://www.pcworld. Problem. Linux Journal (March 2003):3, URL http:
com/article/145449/article.html //dl.acm.org/citation.cfm?id=636750.636753
Méndez JR, Dı́az F, Iglesias EL, Corchado JM (2006) Sakkis G, Androutsopoulos I, Paliouras G, Karkalet-
A Comparative Performance Study of Feature Selec- sis V (2001) Stacking Classifiers for Anti-spam Fil-
tion Methods for the Anti-spam Filtering Domain. In: tering of E-mail. In: Empirical methods in Natural
Advances in Data Mining. Applications in Medicine, Language Processing, pp 44–50
Web Mining, Marketing, Image and Signal Mining, Sanz EP (2008) E-mail Spam Filtering. Advances in
Springer Berlin Heidelberg, pp 106–120 Computers 74:45–109
Metsis V, Androutsopoulos I, Paliouras G (2006) Spam Sasaki M, Shinnou H (2005) Spam Detection Using
Filtering with Naive Bayes Which Naive Bayes ? In: Text Clustering. In: International Conference on Cy-
Proceedings of the 3rd International Conference on berworlds, Ml
E-mail and Anti-Spam, Mountain View, CA, USA, Sculley D, Wachman GM (2007) Relaxed Online SVMs
pp 1–5 for Spam Filtering. In: SIGIR ’07 Proceedings of the
Michelakis E, Androutsopoulos I, Paliouras G, Sakkis G 30th Annual International ACM SIGIR Conference
(2004) Filtron : A Learning-Based Anti-Spam Filter. on Research and Development in Information Re-
In: Proceedings of the 1st Conference on E-mail and trieval, pp 415–422
Anti-Spam (CEAS) Seewald AK (2007) An Evaluation of Naive Bayes Vari-
Nattakant U (2009) Review of Browser Extensions, a ants in Content-Based Learning for Spam Filtering.
Man-in-the- Browser Phishing Techniques Targeting Intelligent Data Analysis 11(5):497–524
Bank Customers. In: Proceedings of the 7th Aus- Sheu Jj (2007) An Efficient Two-phase Spam Filtering
tralian Information Security Management Confer- Method Based on E-mails Categorization. Comput-
ence, pp 4–12 ers & Security 26(1):381–390
Neumayer R (2006) Clustering Based Ensemble Classi- Shi L, Wang Q, Ma X, Weng M, Qiao H (2012)
fication for Spam Filtering. In: Proceedings of the 6th Spam Email Classification Using Decision Tree En-
Workshop on Data Analysis, Elfa Academic Press, pp semble. Journal of Computational Information Sys-
11–22 tems 3(February):949–956
P Resnick (2001) Internet Message Format - RFC 2822 Shi W, Xie M (2013) A Reputation-based Collabora-
. Tech. Rep. April 2001, URL https://tools.ietf. tive Approach for Spam Filtering. In: AASRI Pro-
org/html/rfc2822 cedia, Elsevier B.V., vol 5, pp 220–227, DOI 10.
Pantel P, Lin D (1998) SpamCop : A Spam Classifi- 1016/j.aasri.2013.10.082, URL http://linkinghub.
cation & Organization Program. In: Learning from elsevier.com/retrieve/pii/S2212671613000838
Text Categorization Papers from the AAAI Work- Siefkes C, Assis F, Chhabra S, Yerazunis WS (2004)
shop AAAI Technical Report WS-98-05, Madison, Combining Winnow and Orthogonal Sparse Bigrams
26 Alexy Bhowmick, Shyamanta M. Hazarika

for Incremental Spam Filtering. In: European Con- CS-2004-15). Tech. rep., Computer Science Depart-
ference on Machine Learning (ECML), pp 410–421 ment, Trinity College, Dublin, Ireland
Siponen M, Stucke C (2006) Effective Anti-spam Tsymbal A, Pechenizkiy M, Cunningham P (2008) Dy-
Strategies in Companies : An International Study. In: namic Integration of Classifiers for Handling Concept
Proceedings of the 39th Hawaii International Confer- Drift Dynamic Integration of Classifiers for Handling
ence on System Sciences (HICSS), vol 06, pp 1–10 Concept Drift. Information Fusion 9(1):56–68
Song Y, Kocz A, Giles CL (2009) Better Naive Bayes Wang CC, Chen SY (2007) Using Header Ses-
Classification for High-precision Spam Detection. sion Messages to Anti-spamming. Computers &
Software Practice and Experience (April):1003–1024, Security 26(5):381–390, DOI 10.1016/j.cose.2006.
DOI 10.1002/spe 12.012, URL http://linkinghub.elsevier.com/
Sophos (2013) Security Threat Report 2013. Tech. rep., retrieve/pii/S0167404807000065
Sophos Wang D, Irani D, Pu C (2013) A Study on Evolu-
Sophos (2014) Security Threat Report - 2014. Tech. tion of Email Spam Over Fifteen Years. In: Pro-
rep., Sophos, URL http://www.sophos.com/en-us/ ceedings of the 9th IEEE International Confer-
threat-center/medialibrary/PDFs/other/ ence on Collaborative Computing: Networking, Ap-
sophos-security-threat-report-2014.pdf plications and Worksharing (CollaborateCom), Icst,
Soranamageswari M, Meena C (2010) Statistical Fea- Austin, TX, USA, DOI 10.4108/icst.collaboratecom.
ture Extraction for Classification of Image Spam Us- 2013.254082, URL http://eudl.eu/doi/10.4108/
ing Artificial Neural Networks. In: 2010 Second Inter- icst.collaboratecom.2013.254082
national Conference on Machine Learning and Com- Wang S, Wang B, Lang H, Cheng X (2005) Using Non-
puting, Ieee, pp 101–105, DOI 10.1109/ICMLC.2010. Textual Information to Improve Spam Filtering Per-
72, URL http://ieeexplore.ieee.org/lpdocs/ formance. In: CAS-ICT at Text REtrieval Conference
epic03/wrapper.htm?arnumber=5460761 (TREC) 2005 SPAM Track
Stepp M (2005) PhishHook : A Tool to Detect and Pre- Whissell JS, Clarke CLA (2011) Clustering for Semi-
vent Phishing Attacks. In: In DIMACS Workshop on Supervised Spam Filtering. In: Proceedings of the 8th
Theft in E-Commerce: Content, Identity, and Service Annual Collaboration, Electronic messaging, Anti-
Stern H, Mason J, Shepherd M (2004) A Linguistics- Abuse and Spam Conference (CEAS ’11), pp 125–134
based Attack on Personalised Statistical E-mail Wittel GL, Wu SF (2004) On Attacking Statistical
Classiers. Tech. rep., Dalhousie Univ, URL Spam Filters. In: CEAS: First Conference on Email
https://www.cs.dal.ca/sites/default/files/ and Anti-Spam, Mountain View, CA
technical_reports/CS-2004-06.pdf Woitaszek M, Shaaban M (2003) Identifying Junk Elec-
Symantec (2014) Internet Security Threat Report. tronic Mail in Microsoft Outlook with a Support Vec-
Tech. Rep. April, Symantec tor Machine. In: Proceedings of the 2003 symposium
Taylor B, Fingal D, Aberdeen D (2007) The War on applications and the internet, SAINT, pp 166–169
Against Spam : A report from the Front Line. In: Wu Ch (2009) Behavior-based Spam Detection us-
Workshop on Machine Learning in Adversarial En- ing a Hybrid Method of Rule-based Techniques and
vironments for Computer Security (NIPS 2007), pp Neural Networks. Expert Systems With Applica-
1–3 tions 36(3):4321–4330, DOI 10.1016/j.eswa.2008.03.
Toolan F, Carthy J (2010) Feature Selection for Spam 002, URL http://dx.doi.org/10.1016/j.eswa.
and Phishing Detection. In: eCrime Researchers 2008.03.002
Summit (eCrime), 2010, pp 1–12 Wu CH, Tsai CH (2008) Robust Classification
Tretyakov K (2004) Machine Learning Techniques in for Spam Filtering by Back-propagation Neu-
Spam Filtering. In: Data Mining Problem-oriented ral Networks using Behavior-based Features. Ap-
Seminar, MTAT, May, pp 60–79 plied Intelligence 31(2):107–121, DOI 10.1007/
Tseng CY, Chen MS (2009) Incremental SVM s10489-008-0116-0, URL http://link.springer.
Model for Spam Detection on Dynamic Email com/10.1007/s10489-008-0116-0
Social Networks. In: 2009 International Confer- Wu X, Kumar V, Ross Quinlan J, Ghosh J, Yang Q,
ence on Computational Science and Engineer- Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou
ing, Ieee, pp 128–135, DOI 10.1109/CSE.2009. ZH, Steinbach M, Hand DJ, Steinberg D (2007) Top
260, URL http://ieeexplore.ieee.org/lpdocs/ 10 Algorithms in Data Mining, vol 14. DOI 10.1007/
epic03/wrapper.htm?arnumber=5284281 s10115-007-0114-2, URL http://link.springer.
Tsymbal A (2004) The Problem of Concept Drift : Def- com/10.1007/s10115-007-0114-2
initions and Related Work (Technical Report TCD-
Machine Learning for E-mail Spam Filtering: Review, Techniques and Trends 27

Xie Y, Yu F, Achan K, Panigrahy R, Hulten G, Osip- IEEE Trransactions on Information Forensics and Se-
kov I, Communication CC, Network N (2008) Spam- curity 6(2):486–497
ming Botnets : Signatures and Characteristics. In: Zhuang L, Dunagan J, Simon DR, Wang HJ, Tygar
Proceedings of ACM SIGCOMM08, Seattle, WA JD (2008) Characterizing Botnets from Email Spam
Yang J, Liu Y, Liu Z, Zhu X, Zhang X (2011) A Records. In: LEET 08: First USENIX Workshop on
New Feature Selection Algorithm based on Binomial Large-Scale Exploits and Emergent Threat
Hypothesis Testing for Spam Filtering. Knowledge-
Based Systems 24(6):904–914, DOI 10.1016/j.knosys.
2011.04.006, URL http://linkinghub.elsevier.
com/retrieve/pii/S0950705111000724
Yang J, Liu Y, Zhu X, Liu Z, Zhang X (2012) A
New Feature Selection based on Comprehensive Mea-
surement both in Inter-category and Intra-category
for Text Categorization. Information Processing
& Management 48(4):741–754, DOI 10.1016/j.ipm.
2011.12.005, URL http://linkinghub.elsevier.
com/retrieve/pii/S030645731100118X
Yang Y, Pedersen JO (1997) A Comparative Study on
Feature Selection in Text Categorization. In: ICML
’97 Proceedings of the Fourteenth International Con-
ference on Machine Learning, pp 412–420
Yang Y, Yoo S, Lin F, Moon IC (2010) Personal-
ized Email Prioritization Based on Content and
Social Network Analysis. IEEE Intelligent Systems
25(4):12–18
Yeh Cy, Wu CH, Doong SH (2005) Effective Spam Clas-
sification based on Meta-Heuristics. In: Proceedings
of IEEE International Conference on Systems, Man
and Cybernetics, pp 3872 – 3877
Yerazunis WS (2003) Sparse Binary Polynomial Hash-
ing and the CRM114 Discriminator Rough Guide to
this Talk. In: MIT Spam Conference
Yerazunis WS (2004) The Spam-Filtering Accuracy
Plateau at 99 . 9 percent Accuracy and How to Get
Past It. In: MIT Spam Conference
Youn S, Mcleod D (2006) A Comparative Study for
Email Classification. In: Proceedings of International
Joint Conferences on Computer, Information, Sys-
tem Sciences, and Engineering (CISSE06), Bridge-
port, CT
Yu B, Xu Zb (2008) A Comparative Study for
Content-based Dynamic Spam Classification using
Four Machine Learning Algorithms. Knowledge-
Based Systems 21(4):355–362, DOI 10.1016/j.knosys.
2008.01.001, URL http://linkinghub.elsevier.
com/retrieve/pii/S0950705108000026
Zhang L, Zhu J, Yao T (2004) An Evaluation of Sta-
tistical Spam Filtering Techniques Spam Filtering
as Text Categorization. ACM Transactions on Asian
Language Information Processing (TALIP) 3(4):243–
269
Zhu Y, Tan Y (2011) A Local-Concentration-Based
Feature Extraction Approach for Spam Filtering.

View publication stats

You might also like