E-Mail Spam Filtering: A Review of Techniques and Trends: January 2018
E-Mail Spam Filtering: A Review of Techniques and Trends: January 2018
E-Mail Spam Filtering: A Review of Techniques and Trends: January 2018
net/publication/320703241
CITATIONS READS
25 12,081
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Shyamanta M Hazarika on 02 May 2018.
Abstract We present a comprehensive review of the increasing dependence on e-mail has induced the emer-
most effective content-based e-mail spam filtering tech- gence of many problems caused by ‘illegitimate’ e-mails,
niques. We focus primarily on Machine Learning-based i.e. spam. According to the Text Retrieval Conference
spam filters and their variants, and report on a broad (TREC) the term ‘spam’ is - an unsolicited, unwanted
review ranging from surveying the relevant ideas, ef- e-mail that was sent indiscriminately [Cormack, 2008].
forts, effectiveness, and the current progress. The ini- Spam e-mails are unsolicited, un-ratified and usually
tial exposition of the background examines the basics mass mailed. Spam being a carrier of malware causes
of e-mail spam filtering, the evolving nature of spam, the proliferation of unsolicited advertisements, fraud
spammers playing cat-and-mouse with e-mail service schemes, phishing messages, explicit content, promo-
providers (ESPs), and the Machine Learning front in tions of cause, etc. On an organizational front, spam
fighting spam. We conclude by measuring the impact of effects include: i) annoyance to individual users, ii)
Machine Learning-based filters and explore the promis- less reliable e-mails, iii) loss of work productivity, iv)
ing offshoots of latest developments. misuse of network bandwidth, v) wastage of file server
storage space and computational power, vi) spread of
Keywords E-mail · False positive · Image spam · viruses, worms, and Trojan horses, and vii) financial
Machine learning · Spam · Spam filtering. losses through phishing, Denial of Service (DoS), direc-
tory harvesting attacks, etc.[Siponen and Stucke, 2006].
Over the couple of decades e-mail spam volume has
1 Introduction increased exponentially and is not just an annoyance
but a security threat; as it continues to evolve in its
Electronic-mail (abbreviated as e-mail ) is a fast, effec- potential to do serious damage to individuals, busi-
tive and inexpensive method of exchanging messages nesses and economies. The fact that e-mail is a very
over the Internet. Whether its a personal message from cheap means of reaching to millions of potential cus-
a family member, a company-wide message from the tomers serves as a strong motivation for amateur ad-
boss, researchers across continents sharing recent find- vertisers and direct marketers [Cranor and Lamacchia,
ings, or astronauts staying in touch with their fam- 1998]. For e.g. one of the favorite spam topics is the
ily (via e-mail uplinks or IP phones), e-mail is a pre- ‘penny stock ’ spam or the pump and dump schemes
ferred means for communication. Used worldwide by that take place over the Internet platform. Fraudsters
2.3 billion users, at the time of writing the article, e- (spammers) purchase large quantities of ‘penny stocks’
mail usage is projected to increase up to 4.3 billion ac- i.e. stocks of small, thinly traded companies, through
counts by the year-end 2016 [Radicati, 2016]. But the compromised brokerage accounts and promote them via
message boards or abroad e-mail campaign, pointing to
Alexy Bhowmick · Shyamanta M. Hazarika
School of Engineering,
the transient increase in share value. Even if a frac-
Tezpur University tion of the recipients are fooled into buying the stocks,
Tezpur, Assam, India. the spammers make a huge profit. Unwitting investors
E-mail: alexyb@tezu.ernet.in seeking higher gains believe the hype and purchase the
2 Alexy Bhowmick, Shyamanta M. Hazarika
most widely implemented protocols for the Mail User First the envelope sender address is sent, followed by
Agent (MUA) and are basically used to receive mes- one or more envelope recipient addresses, and finally
sages. A Message Transfer Agent (MTA) receives mails the actual message is sent. The e-mail servers actually
from a sender MUA or some other MTA and then deter- use the envelope address (not the message header ad-
mines the appropriate route for the mail [Katakis et al, dress) to deliver the e-mail to the correct recipient. The
2007]. The recipients MTA delivers the incoming mail final recipient sees only the e-mail header and body.
to the incoming mail server Mail Delivery Agent (MDA) The envelope address is one of the e-mail features that
which is basically a POP/IMAP server. MUAs (e.g. is very often abused by spammers.
Mozilla Thunderbird, Microsoft Outlook, etc.) are e-
mail clients and help the user to read and write e-mails.
Spam filters can be deployed at strategic places in both 2 Characterizing Spam Evolution
clients and servers. Many Internet Service Providers
(ISPs) and organizations deploy spam filters at the e- A couple of decades earlier spam e-mail content was
mail server level, the preferred places to deploy being at mainly textual. Therefore, spam filters analyzed only
the gateways, mail routers, etc. They can be deployed the e-mail body and header to distinguish ham (le-
in clients, where they can be installed at proxies or as gitimate e-mails) from spam e-mails. Today however,
plug-ins, as in [Irwin and Friedman, 2008]. Some spam amateur advertisers and opportunists harness addresses
filters, (e.g. SpamBayes) can be deployed at both server from chat rooms, web pages, newsgroup archives, ser-
and client levels. vice provider directories etc and send junk e-mail blindly
to millions without much cost [Androutsopoulos et al,
2006]. Anti-spam software companies and research groups
1.2 Structure of an E-mail working on spam filtering for quite some time now have
tasted limited success, mostly because spam filtering is
An e-mail comprises of two elements: body and the an adversarial classification task. In such tasks, a ma-
header. The e-mail body comprises of unstructured data licious adversary ‘poisons’ the training data with care-
such as text, HTML markup, multimedia objects and fully crafted attack techniques in order to mislead a
attachments. The header comprises trace information classifier [Jorgensen et al, 2008]. To deliver spam e-
and structured fields that are part of the message con- mail to a huge number of recipients, spammers often
tent. The Simple Mail Transfer Protocol (SMTP) [Jonathan resort to use of bulk mailing software or e-mail har-
B. Postel, 1982] defines e-mail header session to contain vesters [Blanzieri and Bryl, 2008].
fields like - the subject, senders name, e-mail ID, send- Spam evolution has been briefly discussed in sci-
ing date, routing information, timestamp, etc. for recip- entific literature [Carpinter and Hunt, 2006] [Guzella
ient information and successful delivery. Each attribute and Caminhas, 2009] [Almeida and Yamakami, 2012].
(field ) in the header has a name and specific meaning - One reason why spam is difficult to filter is because
– Received: Contains transit-related information of of its dynamic nature. The characteristics (e.g. topics,
e-mail servers, IP addresses, dates, etc. frequent terms, etc) of spam e-mail vary rapidly over
– From: Sender’s name; e-mail ID. “Name” time as spammers always seek to invent new strategies
<e-mail@example.com> to bypass spam filters. These strategies include - word
– To: Recipient’s name; e-mail ID. “Name” obfuscation, image spam, sending e-mail spam from hi-
<e-mail@example.com> jacked computers, etc. A proper understanding of the
– Return Path: Encloses an optional address specifi- spam nature and evolution can help much in the devel-
cation to be used if an error is encountered (bounce). opment of proper countermeasures. Some of the evasion
– Message ID: A single unique message identifier techniques and major trends in spam causes and char-
designated by the mail system. acteristics seen over the years are discussed below:
– X-mailer: The mail software used to create/send
the message.
2.1 Word Obfuscation
– Subject: String identifies the theme of the message
placed by the sender.
Words like ‘sex ’, ‘free’, ‘congratulations’ are good indi-
– Content type: Format of content (character set,
cators of spam and have large (‘spammy’) weights. Ini-
etc.), specified by MIME (Multipurpose Internet Mail
tial spam filters based on heuristic filtering could easily
Extensions).
detect and filter spam e-mails based on the presence of
Each e-mail message comprises of the transit-handling such obvious words. Figure 2 illustrates a word cloud of
envelope [Crocker, 2009] that is hidden from e-mail users. common words in spam e-mail [Greenberg, 2010]; the
4 Alexy Bhowmick, Shyamanta M. Hazarika
larger a word appears, the more often it has been found to defeat the feature selection process by splitting and
to occur in e-mail spam). Spammers adapted quickly by modifying the crucial message features. Examples in-
making sure such obvious words are not encountered clude introducing spaces, special symbols, asterisks in
verbatim in their messages. To defeat filters, they re- words or HTML, JavaScript, CSS layout tricks. A clas-
sorted to simple obfuscation techniques like breaking sic example of evading the recognition of the word ‘VI-
the word into multiple pieces, as - AGRA by the spam filter would be ‘V-I-A-G-R-A’.
– f-r-e-e
embedding special characters
2.2 Bayesian Poisoning Attacks
– fr<!--xx-->ee
using HTML comments A usual criticism of statistical spam filters (e.g. Spam-
– \item <a href=’mailto:%6 Bayes,DSPAM, POPFil e) is that they are susceptible
6ree’>free</a> to ‘poisoning’ by interjection of random words into the
with character-entity encoding spam messages [Fawcett, 2004], [Graham-Cumming, 2006].
– \item o frexe Bayesian poisoning is such a kind of statistical attack in
encoded with HTML ASCII codes which spammers use carefully crafted e-mails to attack
When seen by any web user, all these above sam- the heart of a Bayesian filter and thus degrade its ef-
ples look the same as “free” but they thwart simple fectiveness. The spammers add random or carefully se-
word/phrase filtering and escape the filter rules. The lected legitimate-seeming words in order to confuse the
effectiveness of filter re-training however caused spam- spam filter and cause it to believe an incoming spam e-
mers to abandon one technique and migrate to newer mail is not spam (a statistical II error ). Spammers can
obfuscation techniques. HTML-based obfuscation tech- get these common English words or Ham phrases from
niques are discussed at length in the study by [Pu and sources like - Reuters news articles, written and spo-
Webb, 2006]. Spammers resorted to the use of innocu- ken English, and USENET messages. These strong sta-
ous words to obfuscate the e-mail message content in tistical attacks have an incidental consequence too - a
order to confuse or circumvent spam filters. In general, statistical I error or simply a higher false positive rate.
there are many ways to obscure the e-mail content: The reason is that when the user trains the spam filter
misplaced spaces, purposeful misspellings, embedded with the poisoned training data, the spam filter ‘learns’
special characters (letter substitution), Unicode letter about such random words as being good evidences of
transliteration [Liu and Stamm, 2007], HTML redraw- spam [Sanz, 2008]. Paul Graham [Graham, 2002b] how-
ing, etc. Tokenization attacks are a similar spamming ever played down the effectiveness of such poisoning
technique more associated with the preprocessing stage techniques arguing that to outweigh the statistical sig-
in spam filtering. In tokenization the spammer works nificance of even one incriminating word as “viagra”,
Machine Learning for E-mail Spam Filtering: Review, Techniques and Trends 5
spammers would need many innocent words (e.g. names of recipients. Hence, backscatter qualifies as unsolicited
of ones friends and family, terms used at work, etc) bulk e-mail and is spam. Misdirected bounces from mail
which are unique for each recipient and spammers have servers, misdirected ”please confirm your subscription”
no way of figuring them out. However, evidence sug- requests from mailing lists, ”out of office” vacation au-
gests Bayesian poisoning is real and cannot be dismissed toreplies and auto-responders, challenge requests from
[Biggio et al, 2011]. Challenge/Request Systems, etc. are the major varieties
Graham-Cumming [Graham-Cumming, 2004] [Graham- of backscatter. Backscatter, also called ‘collateral spam’
Cumming, 2006], identified two types of possible at- is a direct consequence of spam.
tacks on Bayesian filters: passive (where in absence of [Cormack and Lynam, 2007] experimented with six
feedback, the spammer can at best make educated guesses) open-source filters and a test set of 49,086 messages
and active (where the spammer discovers an effective with backscatter representing a mere 1% of the to-
wordlist after getting feedback). [Lowd and Meek, 2005] tal spam in the test set. It was found that content-
investigated ‘good word attacks’ where a spammer ap- based spam filters could filter 98% of the spam, but
pends words indicative of legitimate e-mail, and found backscatter was found to be most difficult to classify
Naive Bayes extremely vulnerable to both scenarios of with nearly all the backscatter messages being misclas-
active and passive attacks. Their results showed fre- sified. Backscatter is a problem that is hard to deal
quent filter re-training could mitigate the effective of with and though spammers may be blamed for it, it
these attacks. [Wittel and Wu, 2004] explored a simple simply exists because our mail servers are configured
passive attack of poisoning with random words (a dic- to bounce messages back to fake addresses rather than
tionary attack ) and found it ineffective against CRM1141 , just reject such spam immediately [McMillan, 2008].
but effective against SpamBayes.2 A smarter passive Servers that generate e-mail backscatter can land up
attack with common or ‘hammy’ words (common word on various DNS-based Blacklists (DNSBLs). Improp-
or focused attack ) saw SpamBayes perform even worse erly configured e-mail servers gives rise to ‘open relays’
but CRM114 remained very resistant. [Stern et al, 2004] which contribute to the problem of backscatter. Open
showed that injecting common words from the English relay servers can also get listed in various DNSBLs.
language led to the performance decrease of SpamBayes.
Published research indicates that Bayesian poisoning is
real and the number of published attack methods in- 2.4 Image Spam
dicates that it cannot be dismissed and that further
investigation on poisoning of statistical spam filters is Text-based spam filters are designed only to analyze
a worthwhile task of research. different components of an e-mail (sender’s address,
header, body, attachments) and detect specific spam
characteristics. A new type of spam called image-based
2.3 Backscatter Spam
spam or image spam is a rapidly spreading. It involves
When an e-mail is sent, the sender is normally informed textual spam content embedded into images that are at-
if the e-mail could not be delivered or the delivery tached to e-mails. OCR (Optical Character Recognition)-
was delayed for some reason. E-mail servers normally based modules are effective to a limited extent against
send a bounce message notifying the sender of deliv- image-spam [Biggio et al, 2006] [Fumera, 2006]. But of-
ery problems. Such a message is termed - Delivery Sta- ten the textual content is obfuscated by spammers to
tus Notification (DSN). Mostly, DSNs are welcome to evade OCR tools. Till 2010, the upsurge of spam e-mails
the sender and they are generally sent to the envelope meant that roughly up to 85% of all e-mail spam were
sender address. Backscatter occurs when DSNs are sent image spam [Wu and Tsai, 2008].
to senders whose addresses are forged in the message SpamAssassin, a widely used commercial and open-
envelope by spammers. In other words, backscatter are source spam filter provides several OCR plug-ins (e.g.
delivery notifications from another server, rejecting an OCR Plugin 3 , Fuzzy OCR Plugin4 , and Bayes OCR
e-mail made to come across as being mailed from an Plugin5 ) that can be used to detect image spam. It has
account [Cormack and Lynam, 2007]. These mails are been established from current literature that the apply-
then delivered unsolicited in bulk quantities to a lot ing modern classification approaches to the generated
1 3
CRM 114 - the Controllable Regex Mutilator, an open- A SpamAssassin OCR plug-in is maintained at:
source spam filtering device http://wiki.apache.org/spamassassin/OcrPlugin
2 4
SpamBayes - a popular open-source spam filtering tool, Fuzzy OCR is available but no longer maintained.
5
with 700,000 downloads, is based on techniques laid out by A beta version of BayesOCR plugin is available at
Paul Graham. http://pralab.diee.unica.it/en/BayesOCR
6 Alexy Bhowmick, Shyamanta M. Hazarika
text from image spam is very efficient. Later, signa- spoofed websites, or even recruit new bots, and so on.
tures were also generated to easily detect and filter al- Botnets on the other hand constitute a major threat to
ready known image spam. The spam filter database by the Internet infrastructure as they have the capability
mid-2012 contained more than 40 million relevant spam to - mount crippling denial of service (DoS) attacks on
signatures [IBM, 2012]. In order to avoid signature- servers, generate click-fraud [Perera et al, 2013], send
based techniques, spammers switched tactics by mak- out a flood of spam and backscatter [Xie et al, 2008]
ing arbitrary alterations to a specified template image. facilitate phishing and pump-and-dump schemes, form
They began employing obfuscation, similar to the ap- a computational grid to break weak passwords or ob-
proach usually applied in web forums, to outsmart Op- fuscate the operators point of origin, etc. Botnets run
tical Character Recognition (OCR) tools. Lately, Pat- on the global level outside the range of national bound-
tern Recognition techniques and Computer Vision are aries. According to public tracker Shadowserver 6 , at
playing a significant part in filtering of multimedia data. least one million zombie machines or bots are believed
However, the solutions achieved so far have shortcom- to be active and the number is still growing.
ings, and their efficiency is yet to be systematically in-
Identifying and blacklisting each and every bot is
vestigated [Biggio et al, 2006].
challenging, both because a botnet attack is momen-
Image-based filtering involves extraction of relevant
tary and the fact that a single bot transmits only a
features from the image and classification by state-of-
small volume of spam e-mails to avoid detection. On the
the-art classifiers. Image-based spam detection is an
other hand spammers are using large Botnets to send
example of classification of multimedia data. A num-
spam, thus creating extremely a huge number of IP ad-
ber of researchers have devised approaches based on
dresses to be blacklisted. Grum, a sneaky, kernel-mode
Pattern Recognition and Computer Vision to address
rootkit was of notable interest to researchers. It was a
different forms of image spam. In general they can be
relatively small botnet with only 600,000 members. Yet
grouped into two broad categories: a) OCR-based tech-
it was responsible for almost 25 percent, or 40 billion
niques and b) Low level image features based tech-
spam e-mails a day before it was finally taken down.
niques. The use of OCR tools to extract text embedded
Identifying botnets is a new challenge for the anti-spam
into images, and processing it using modern text cat-
industry, and tracking spammers and bringing them to
egorization techniques was thoroughly investigated by
justice, and pulling down botnet servers becomes an
[Fumera, 2006]. But OCRs have been proven to be com-
international undertaking. July 18, 2012 saw the take
putationally expensive and not accurate enough in ad-
down of the Grum botnet [Sophos, 2013]. Recently, as a
versarial situations [Goodman et al, 2007] [Attar et al,
repercussion of the bombing incident during the Boston
2011]. [Biggio et al, 2006] surveyed and categorized the
Marathon which happened on April 15, 2013, botnet
major techniques which have been suggested as image
spam related to the Boston Marathon bombing was
spam solutions. Spilling of image spam onto social net-
found to have constituted 40 percent of all spam mes-
works like Twitter or Facebook has become widespread.
sages transmitted globally on subsequent days [CISCO,
Extraction of features for image-based spam filters is
2014].
further discussed in Sec 3. A detailed and recent review
involving definitions, spam tricks, complete classifica- According to CISCO report [CISCO, 2007], botnets
tion of image spam filtering techniques and datasets are the primary security threat on the Internet today.
may be found in [Wu and Tsai, 2008]. Botnets are hard to detect because of their dynamic na-
ture and their adaptability in evading the common se-
curity defenses. Botnets have been studied thoroughly,
2.5 Botnet Spam
particularly in the context of spam and phishing [Xie
et al, 2008], [John et al, 2009] and [Zhuang et al, 2008].
At a time when blacklists had almost put the spam-
Botnets are emerging as the most severe threat against
mers out of business and diminished their profits, some
cyber-security as they provide a distributed platform
enterprising spammers joined hands with virus and ex-
for several unlawful activities like distributed denial of
ploit code writers to get access to compromised ma-
service (DDoS) attacks, malware dissemination, phish-
chines on the Internet known as ‘bots’ or ‘zombies’.
ing, scanning and click fraud. Because botnets attack
The term botnet applies to an army of machines that
from multiple fronts there is no single technology that
are compromised and controlled by a single ‘botmaster ’.
can provide protection from it.
A bot, when subverted (e.g. by a virus/Trojan infec-
tion or by a specific bot software), can be used to send
out spam or malware, harvest password and login in-
formation for identity theft and fraud, re-route users to 6
https://www.shadowserver.org/wiki/
Machine Learning for E-mail Spam Filtering: Review, Techniques and Trends 7
2.6 Social Engineering - Phishing often were found to be Barclays, Bank of America, Pay-
Pal, eBay etc [Ludl et al, 2007]. Phishing attacks and
Spammers are increasingly adopting the use of social identity theft-based scams are becoming more sophis-
engineering techniques in the spam campaign. Patient ticated in their exploitation of social engineering tech-
and committed attackers perform extensive research and niques. While spamming affects bandwidth; social en-
gain a sophisticated understanding of the needs and gineering attacks like phishing directly affect their vic-
motivation of recipients and then contact them with tims. In recent years ‘pharming’ has evolved to be a ma-
highly believable communications (e.g. e-mails or social jor concern to e-commerce and banks sites. In ‘Pharm-
networking message) which may reflect knowledge of ing’ the attacker redirects unsuspecting users to fake
the individuals’ work activities, colleagues, friends, and sites or proxy servers with seeded scripts [Abu-nimeh
family. Phishing is an illegal attempt that exploits both et al, 2007] [Kaspersky, 2014]. The Internet Security
social engineering and technical deception to acquire Threat Report [Symantec, 2014] states that in 2013 the
sensitive confidential data (e.g. social security number, rate of phishing had increased, from 1 in 414 for 2012
e-mail address, passwords, etc.) and financial account to 1 in 392 in 2013. Much of these phishing attempts in-
credentials [Robinson, 2003] [Bergholz et al, 2010]. Phish- volve the creation of fake login pages for popular social
ing involves spam e-mails disguised as legitimate with a networks sites. Besides spoofing login pages of legiti-
subject or message designed to trick the victims into re- mate sites, phishers also began launching baits relevant
vealing confidential information. In deceptive phishing, to current events for flavouring the phishing pages.
e-mail notifications appearing to come from credit card
Several browser extensions (e.g. SafeCache and Safe-
companies, security agencies, banks, providers, online
History for Mozilla) and plug-ins (e.g. SpoofGuard) have
payment processors or IT administrators are commonly
been proposed [Chou et al, 2004], [Stepp, 2005], [Nat-
used to exploit the unsuspecting public. The notifica-
takant, 2009] and [Sta, 2014]. [Chandrasekaran et al,
tion encourages the recipient to urgently enter/update
2006] have pointed out several weaknesses of existing
their personal data. In most cases, the fraudsters try
browser-based solutions and proposed a novel Support
to frighten a recipient by some ”urgent” matter (e.g.
Vector Machine (SVM) - based technique for e-mail
”We suspect an unauthorized transaction on your ac-
spam filtering based on the inherent structural prop-
count. To ensure that your account is not compromised,
erties in phishing e-mails. [Abu-nimeh et al, 2007] eval-
please click the link below and confirm your identity”)
uated the predictive accuracy of six popular machine
that requires their immediate attention and divulging
learning-based classifiers on phishing data sets. Phish-
of their personal information. It is often accompanied
ing countermeasures such as secure e-mail authentica-
by a threat to block the account within a limited period,
tion, password hashing, etc. involves high administra-
if not responded. Once information such as user-name
tive overhead, hence content-based filtering can be used
and password are entered, it becomes a clear case of
to detect phishing attacks and improve existing solu-
identity theft followed by worse consequences such as
tions. While we agree client-side solutions for phishing
transfer of cash from a victims account, official docu-
have been developed over the years even by huge soft-
ments being obtained, or goods being purchased using
ware companies, server-side solutions are the focus of
stolen credentials. Malicious users are also interested in
research [Abu-nimeh et al, 2007], [Fette et al, 2007] and
other types of passwords, such as those for social net-
[Basnet et al, 2008].
works, e-mail accounts and other services [Kaspersky,
2014]. In malware-based phishing, malicious software is Bergholz et al [2010] have identified a number of
spread through e-mails or by exploiting security loop- highly informative features about phishing attempts
holes and installed on the user’s machine. The malware and also proposed a server-side statistical phishing fil-
may then capture user inputs, and confidential infor- ter. The success of phishing is largely determined by the
mation may be sent to the ‘phisher’ [Bergholz et al, low levels of user-awareness regarding how the fraud-
2010]. The phishers’ top targets in 2012 were social net- sters and spoof sites operate. Increasing user aware-
works, financial institutions, non-profit organizations ness will help them to learn to spot the telltale signs
and search engines [IBM, 2014] [Kaspersky, 2014]. of social engineering tricks, which includes, undue pres-
Phishing attacks use e-mail as their main carrier in sure, a false sense of urgency, bogus official letters, too-
order to allure unmindful victims. Phishing can also good-to-be-true offers, quid-pro-quo offers, etc. Mean-
occur on a fake web site that is a perfect replica of while spam filters remain the first line of defense against
the official site, such as the log-in page for a banking phishing. According to Anti-Phishing Working Group
web site, to harvest e-mail addresses and log-on cre- [Anti-Phishing Working Group (APWG), 2014] new brands
dentials of their victims. The companies spoofed most continue to be targeted by phishers and to battle these
8 Alexy Bhowmick, Shyamanta M. Hazarika
phishing attacks, presently the world needs more phish- employing word stemming and lemmatization are
ing databases. feature space dimension reduction and classifier ac-
curacy.
– Representation: Involves the conversion of an e-
3 Corpus Preprocessing mail message into a specific or structured format
as needed by the machine learning algorithm being
Not all information present in an e-mail is necessary employed.
or useful. Eliminating the less informative and noisy
terms lowers the feature space dimensionality and en- [Androutsopoulos et al, 2000a] studied the effect of
hances classification performance in most cases [Guzella corpus size, lemmatization, and stop-lists while in [An-
and Caminhas, 2009], [Diao et al, 2003] and [Shi et al, droutsopoulos et al, 2000c], they studied the effect of
2012]. Corpus preprocessing is a process that involves word stemming and stop-word removal on the perfor-
transforming the mail corpus into a uniform format that mancce of classifiers. Their results show that often they
is more comprehensible to the machine learning algo- do not contribute to much improvement over the filters
rithms [Zhang et al, 2004], [Katakis et al, 2007]. Due without them. [Chih-Chin Lai and Tsai, 2004] found
to the adversarial nature of spam, spam filters need to that stemming did not introduce any significant im-
constantly adapt to changing spam tactics, particularly provement in the filter’s performance, though it did re-
in feature extraction and feature selection aspects. No duce the feature set size. On the contrary, employing
matter which learning strategy is chosen for the train- stopping produced better performance.
ing and testing of content-based filters, it is extremely
crucial to handcraft a private corpus or use a corpus
that is publicly available. In any case, e-mails need to 3.1 Extracting Features
undergo preprocessing as a preparation for feature ex-
The easiest feature extraction method is the bag of
traction. Furthermore, a corpus may have an immense
words (BOW) model (or vector-space model ), in which
number of features, it is very important to choose fea-
words occurring in the e-mail are treated as features.
tures judiciously so as to prevent the classifiers from
Given a set of terms T = {t1 , t2 , t3 ...tn }, the bag of
over-fitting [Drucker et al, 1999]. The effectiveness and
words model represents a document d as an N-dimensional
success of content-based spam filters depends on - fea-
feature vector x = {x1 , x2 , x3 ...xn } where xi is a func-
ture engineering i.e. defining and creating those features
tion of the occurrence of ti in d. It is possible to use all
more likely to make the classifier perform better. The
the features for classification. However a feature selec-
primary steps involved in extraction of features from an
tion mechanism may be applied to select the best N fea-
e-mail are -
tures by some measure and thus reduce dimensionality.
– Lexical Analysis (Tokenization): The string of Another simple text representation is the bag of charac-
text representing a message is tokenized in order ter n-grams. [Kanaris et al, 2006] investigated on char-
to identify the candidate words to be adopted as acter n-grams and words in spam filtering to demon-
relevant spam or ham terms. Headers, attachments, strate the advantage of n-grams over word-tokens. Sparse
and HTML tags are stripped, leaving behind just Binary Polynomial Hashing (SBPH) [Yerazunis, 2003]
the e-mail body and subject line text. IP addresses is another feature generator from e-mails. However, its
and domain names can also be considered as tokens. many features made it computationally heavy and of
– Stop-word Removal : Stop-word removal involves limited use. Siefkes et al [2004] proposed an effective
removing frequently used non-informative words, e.g. feature combination technique known as the Orthog-
‘a’, ‘an’, ‘the’, and ‘is’, etc. Obscure texts or sym- onal Sparse Bigrams (OSB) to extract more compact
bols may also be removed in subsequent steps. Stop- features. Experiments showed that OSB slightly per-
word removal makes the selection of candidate terms formed better than SBPH with regard to error rate.
more efficient and reduces the feature space consid- Recently, [Zhu and Tan, 2011] proposed a feature ex-
erably. traction approach based on local concentration (LC)
– Stemming : Word-stemming is a term used to de- which efficiently extracted position-correlated informa-
scribe a process of converting words to their mor- tion from e-mail messages. For each style of e-mail anal-
phological base forms, mainly eliminating plurals, ysis, a spam filter developer must decide on a way for
tenses, gerund forms, prefixes and suffixes. Stem- performing feature extraction.
ming is closely related to lemmatization which while
reducing a word considers the part of speech and
the context of the word. The primary advantages of
Machine Learning for E-mail Spam Filtering: Review, Techniques and Trends 9
Table 1 A summary of Feature Extraction and Feature Selection techniques in popular literature.
Authors Approaches
[Zhang et al, 2004] Studied subject line, header, and message body.
Employed Information Gain (IG), Document Frequency (DF), and χ2 test (CHI) for feature
selection.
Found bag of words model quite effective on spam filtering, and header features as important as
message body.
[Kanaris et al, 2006] Extracted character n-grams of fixed length, Variable-length character n-grams
Explored Information Gain (IG) as a feature selection technique
Character n-grams were noted to be richer and definitve than word-tokens.
[Delany and Bridge, 2006] Considered features of three types: word, character, structured features. in a feature-based vs
feature-free comparison.
Employed Information Gain (IG) as a feature selection technique
Noted feature-free methods to be more correct than the feature-based system, however feature-
free approaches took much longer than feature-based approach in classifying e-mails.
[Diao et al, 2003] Experimented on features: Header (H), Textual (T), handcrafted features (HH), etc.
Different ways of feature selection for Decision Tree and Naive Bayes models were evaluated
The usefulness and importance of different type of features were discussed in detail in experi-
ments.
3.4.1 Analyzing Temporal Features curacy is perhaps one of the most uncharted territories
in spam filtering research.
3.4.3 Behavior Analysis Articles classified under ‘Algorithm’ reflect research that
essentially focused on classification algorithms and their
The behavioral pattern of an e-mail is ‘what the sender implementations and evaluations. Articles classified un-
does in composing or distributing e-mails’. Legitimate der ‘Architecture’ concentrated on work mainly involved
e-mails have mostly normal and meaningful behavioral with the development of spam filtering infrastructures.
patterns, while spam e-mails have abnormal or even Articles classified under ‘Methods’ refers to study of
conflicting behavior patterns. [Yeh et al, 2005] consid- the existing filtering methods while ‘Trends’ speaks of
ered behavior patterns such as data spoofing, time discourses concentrating on emerging methods and the
anomaly, relay anomaly, etc; and described them by adaptation of spam filtering methods over time. Limi-
meta-heuristics and employed them as features for the tations listed in the last column, corresponding to each
classification task. To recognize spam and viruses as article are as acknowledged by the authors themselves.
irregular behviour in the e-mail, [Hershkop, 2006] pro-
posed some behavior models, some of them are recip-
ient frequency, group communication, user’s past ac- 4 Methods for Mitigating E-mail Spam
tivity histogram, etc. [Ramachandran and Feamster,
2006] studied the spammer’s behaviour at the network Although there are ‘social ’ methods like legal measures
level and found that most spam was received from a and personal measures (e.g. never respond to spam,
small number of regions of IP address space. They sug- never forward chain-letters) to fight spam, they have
gested that filtering based on network-level character- had a narrow effect on spam so far is seen by the num-
istics would be much more effective to combat spam ber of spam messages received daily by users. Technical
as network-level properties are less malleable than e- measures seem to be the most effective in countering
mail content. [Li et al, 2007] performed an experimen- spam. Prior to machine learning techniques, many dif-
tal study of the community behavior of spammers and ferent technical measures were employed for spam fil-
came up with various clustering structures among their tering, like - rule-based spam filtering, white lists, black
population. Based on those structures they proposed lists, challenge-response (C/R) systems, spam filtering,
some group-based anti-spam strategies exploiting group honey pots, OCR filters, and many others, each with
membership of perceived spam sources. Further work on its own merits and drawbacks. Black-lists, white-lists,
investigating clustering structures of spammers based challenge-response (C/R) systems, etc. are origin-based
on features as - Content length, Time of arrival, Fre- techniques used by reputation-based filters. We discuss
quency of e-mail, etc. was carried out by [Hao et al, briefly some of these popular approaches:
2009].
Table 2 A summary popular machine learning attempts by authors according to perspective (Algorithm, Architecture, Meth-
ods, and Trends), with their strengths and limitations.
MDL principle, SVM Uses six, well known, large public databases.
[Almeida and Yamakami, 2012]
Algorithms, Methods Bogofilter, SpamAssassin filters not considered.
began employing content ”obfuscation” (or obscuring), and maintained either at the user or server level. If a
by disguising certain terms that are very common in user receives an e-mail from any of these addresses, the
Spam messages (e.g., by writing ”v!@gra” instead of message is automatically blocked at the SMTP connec-
”viagra”, or ”F*r*e*e” instead of ”Free”) on an attempt tion phase. This method requires only a simple lookup
to prevent the correct identification of these terms by in the blacklist every time; hence the computational
Spam filters. Moreover writing regular expression-based cost is low. Black-lists include Real-time Blackhole ListS
rules are hard and error prone. In spite of these limi- (RBL) and Domain Name System Black-lists. Com-
tations, Symantec Brightmail Sanz [2008], a rule-based mon black-list databases include proxies or open re-
filter solution was a success from 2004 till the end of the lays, networks or individual addresses guilty of sending
last decade. It could even track down IP addresses that spam. Google blacklists and SpamHaus 7 are examples
sent mostly junk mail and performed competitively to of blacklists.
SpamBayes - a popular Nave Bayes-based anti-spam Blacklist techniques though effective, suffer from many
solution. drawbacks. A legitimate address may be blacklisted by
the filter erroneously or arbitrarily. Innocent users can
get victimized and entire domains (e.g. Hotmail) can
4.2 Blacklisting get blocked when e-mail IDs or IP addresses are used
by spammers without the owners consent. As spam-
A blacklist of E-mail addresses or IP addresses of the
server from which spam is found to originate is created 7
http://www.spamhaus.org/
Machine Learning for E-mail Spam Filtering: Review, Techniques and Trends 13
mers resort to use of new E-mail IDs or IP addresses 4.5 Challenge Response (CR) systems
to bypass blacklists, frequent updates are required to
keep the blacklists up-to-date. Lately, use of botnets by While white-lists place the burden of determining the
spammers creates an extremely huge number of IP ad- authenticity of contacts on the receiver, Challenge-Response
dresses to be blacklisted. While the time and effort for (CR) systems transfer the burden of authentication back
updating can be overwhelming, any lag in its timeliness to the sender. After sending an e-mail to the receiver,
leads to vulnerabilities. the sender receives a challenge from the receiving Mail
Transfer Agent (MTA). The challenge may range from
a simple question to a CAPTCHA (”Completely Auto-
mated Public Turing test to tell Computers and Humans
Apart”). The sender is obliged to reply correctly in his
4.3 Whitelisting response; else his message will be deleted or put into
spam folder. While this method is effective in catch-
Whitelisting is the reverse of blacklisting. An e-mail ing spam from automated systems or botnets, it intro-
whitelist is a list of pre-approved or trusted contacts, duces an undesirable delay in the delivery process. CR
domains, or IP addresses that are able to communicate systems are controversial solutions and are often criti-
to a mail user. All e-mails from fresh e-mail addresses cized due to this inconvenience caused by the overhead
are blocked by this method. This restrictive method in communication. Besides legitimate e-mails from au-
may introduce an extremely high false positive rate tomated mailing lists may also be blocked since these
instead of reducing it. Such a method may be good will fail the challenge. In addition, CR systems are also
for instant messaging environments but is not a good believed to be the cause behind the backscatter e-mail
choice as it prohibits establishing new contacts through phenomenon. [Isacenkova and Balzarotti, 2011] devel-
e-mail. Moreover if spammers somehow got their hands oped a real world deployment of a CR based anti-spam
on the whitelist, it would be easy to evade the filter us- system and evaluated its effectiveness and impact on
ing spoofed addresses, or using well-known whitelisted end-users.
mailing lists. This method requires a lot of maintenance
but provides moderate filtering rate. It can be employed
together with other anti-spam techniques [Michelakis
et al, 2004]. 4.6 Collaborative Spam filtering
and ‘refinance’ will have high spam probability values, SBPH is a generalization of Bayesian filtering that can
while names of friends and siblings will have low spam match mutating phrases as well as individual words or
probability values. These apriori probabilities are com- tokens, and uses the Bayesian Chain Rule (BCR) to
bined with the observed data set - which is a sizeable combine the individual feature conditional probabilities
collection of e-mails that has already been categorized into an overall probability. SBPH had a more expressive
as ‘spam’ and ‘ham’, to determine the final probabil- feature space and delivered ¿99.9% accuracy on real-
ity that an e-mail message is either spam or legitimate. time e-mail without white-lists or blacklists from as lit-
Even with the flawed assumption of presumed decorre- tle as 500K of pre-categorized text. However, SBPH was
lation, Bayesian classifiers work extremely well and are computationally expensive; OSB retains the expressiv-
surprisingly effective [M. Sahami, S. Dumais, D. Heck- ity of SBPH but avoids most of the cost. A filter based
erman et al, 1998], [Pantel and Lin, 1998], [Graham, on OSB, along with the non-probabilistic Winnow al-
2003] and [Yerazunis, 2004]. gorithm as a replacement for the Bayesian Chain Rule
outperformed SBPH by 0.04% error rate; however, OSB
Nave Bayes method has become extremely popular
used just 6, 00,000 features, while SBPH used 1,600,000
due to the high levels of accuracy that it can potentially
features to reach best results. Yerazunis [2004] argued
provide and it often serves as a baseline classifier for
that most Bayesian filters seem to reach a plateau of
comparison with other filtering approaches. Bayesian
accuracy at 99.9 percent so enhancements were nec-
filters are the most employed filters for classifying spam
essary. They set up a SBPH/BCR classifier and com-
nowadays Guzella and Caminhas [2009], Metsis et al
pared three different training methods: TEFT Train
[2006] and can operate either on the network mail server
Every Thing, TOE Train Only Errors, TUNE Train
level or on client e-mail programs.
Until No Errors, and found TOE training to be ac-
One limitation of standard Bayesian filters is that it ceptable in performance and accuracy. Different exten-
ignores the correlation among inputs or events; i.e. such sions to Bayesian filtering as Token Grab Bag, Token
filters do not consider that the words ‘special ’ and ‘of- Sequence Sensitive, Sparse Binary Polynomial Hash-
fers’ are more likely to appear together in spam e-mail ing with Bayesian Chain Rule (SBPH/BCR), Peaking
than in legitimate e-mail Carpinter and Hunt [2006]. Sparse Binary Polynomial Hashing, Markovian match-
But text analysis confirms that words have a very sig- ing, were also tested. Markovian matching produced the
nificant correlation and are not chosen randomly. In best performance of all the filters.
spite of this over simplistic assumption, Bayesian classi-
fiers have been found to work remarkably well Androut- According to Ludlow [2002], the vast majority of the
sopoulos et al [2006] and Almeida et al [2010]. How- tens of millions of spam e-mails might be the handiwork
ever to address this limitation, Yerazunis [2003] and of only 150 spammers around the world; Again, au-
Siefkes et al [2004] introduced sparse binary polynomial thors have ‘textual fingerprints’, at least for texts pro-
hashing (SBPH) and orthogonal sparse bigrams (OSB). duced by writers who are not consciously changing their
16 Alexy Bhowmick, Shyamanta M. Hazarika
style of writing across texts, as argued by Baayen et al 5.2 Support Vector Machines (SVMs)
[2002]. Therefore authorship identification techniques
can be used to identify the ‘textual fingerprints’ of this Support vector machines (SVMs) are ranked as one of
small group and eliminate a significant proportion of the best ‘off-the-shelf ’ supervised learning algorithm.
spam. Brien and Vogel [2003] were the first to apply SVMs have become one of the most sought-after classi-
authorship identification techniques as ‘Chi by degrees fiers in the Machine Learning community because they
of freedom’ method to the area of e-mail spam filter- provide superior generalization performance, require less
ing. The authors examined the Nave Bayesian method examples for training, and can tackle high-dimensional
in relation to this authorship identification technique. data with the help of kernels [Rios and Zha, 2004] [Wu
They found that the Bayesian method was very effec- et al, 2007]. Support vector machines (SVMs) result
tive when characters were used as tokens, rather than by mapping the feature vectors (training data) into a
when words were used as tokens. The ‘Chi by degrees of linear or non-linear feature space through a kernel func-
freedom’ method when used with characters as tokens tion. The feature space generates an optimal separating
had an error rate lesser than the Bayes method. They hyper-plane (OSH) which splits the positive samples
concluded that tokens chosen affected the precision and and the negative samples with maximum margin. The
recall parameters. Taking a leaf out of text classifica- hyper-plane is then employed as a non-linear decision
tion, Song et al [2009] proposed a correlation-based doc- boundary for use in real-world data.
ument term weighting method to address the problem [Drucker et al, 1999] used SVMs for content-based
of low-FPR classification in the context of Nave Bayes. classification and equated their performance with other
classifiers - Ripper, Rocchio, and boosting of C4.5 deci-
sion trees. It was found that boosting trees and SVMs
[Chih-Chin Lai and Tsai, 2004] conducted system-
attained good performance with regard to speed and ac-
atized experiments on e-mail categorization involving
curacy during testing. SVMs with binary features pro-
Naive Bayes (NB), Term Frequency - Inverse Document
duced best results, required lesser training, and their
Frequency (TF-IDF), k-Nearest Neighbor (k-NN), and
performance did not degrade when too many features
Support Vector Machines (SVMs). NB, TF-IDF and
were used. Woitaszek and Shaaban [Woitaszek and Shaa-
SVM achieved satisfactory results while k-NN had the
ban, 2003] utilized an SVM-based filter for Microsoft
worst performance out of all. It was seen that stemming
Outlook to identify commercial e-mail. Classification
did not affect performance, however employing stopping
models for spam and ham messages were built by the
procedure yielded better performance. They concluded
SVM using personal and impersonal dictionaries. Both
that combining the different techniques seemed a very
yielded identical results attaining a best accuracy of
promising prospect. [Lai, 2007] has made a similar com-
96.69%. [Rios and Zha, 2004] experimented with SVMs
parative study on three commonly used algorithms in
and Random Forests (RFs) and compared them against
Machine Learning NB, k-NN and SVMs. From exper-
Naive Bayes models. They concluded that SVM and
imental results, NB and SVM were found to perform
RF classifiers were equivalent, and that the RF classi-
better than k-NN. [Youn and Mcleod, 2006] and [Yu
fier had greater robustness at low false positive (FP)
and Xu, 2008] noted similar experiments with four ma-
rates; they both outperformed Naive Bayes models at
chine learning algorithm each. [Seewald, 2007] investi-
low FP rates. [Tseng and Chen, 2009] proposed a com-
gated the simple Naive Bayes learner represented by
plete spam detection system MailNET, which is an in-
SpamBayes, and two variants of Naive Bayes learning,
cremental SVM model on dynamic e-mail social net-
SA-Train and CRM-114. SA-Train incorporated back-
works. Although SVMs provide high accuracy for spam
ground knowledge made up of rules while CRM-114
filtering, they have been generally associated with high
considered multi-word phrases and their probability es-
computational cost and some expensive false positive
timates. It was seen that all three systems performed
errors, hence, few solutions were offered, e.g. Online
equally well and the addition of background knowledge
SVMs [Sculley and Wachman, 2007], Ensemble of SVMs
to SA-Train and the extended description language in
[Blanco et al, 2007], etc. A detailed study of various
the case of CRM-114 considering multi-word phrases
distance-based kernels and spam filtering behaviors em-
failed to improve Bayesian learning significantly. Spam-
ploying SVM is found in [Amayri and Bouguila, 2010].
Bayes offered the most stable performance and deteri-
orated least over time. [Almeida et al, 2010] reported
that probabilistic approaches like Bayesian classifica- 5.3 Clustering Techniques
tion suffer from the ‘curse of dimensionality’. They ver-
ified how dimensionality reduction influences the accu- Clustering is the task of grouping a set of patterns into
racy of Nave Bayesian spam filters. similar groups. Clustering techniques have been widely
Machine Learning for E-mail Spam Filtering: Review, Techniques and Trends 17
studied and used in a variety of application domains. Bagging (or bootstrap aggregating) is an ensemble
Spam filtering datasets often have true labels available meta-learning algorithm that is usually applied to deci-
and clustering algorithms, being unsupervised learning sion tree methods, e.g. Random Forest algorithm is an
tools are not always closely related with true labelings. ensemble technique for decision trees that is known to
However given suitable representations, most clustering achieve very high classification accuracy. [Biggio et al,
algorithms can partion e-mail spam datasets into ham 2011] employed bagging ensembles to exploit against
and spam clusters. This was demonstrated by [Whissell poisoning attacks on spam filters. Random forests have
and Clarke, 2011] in a novel investigation of e-mail spam also been used in the spam detection model described
clustering. The results were surprisingly significant as in [Debarr and Wechsler, 2009] and [Lee et al, 2010b].
their clustering based approach bettered those of pre- Boosting [Biggio et al, 2011] involves algorithms that
viously published state-of-the-art semi-supervised ap- build a single strong learner from a set of weak learn-
proaches, hence proving that clustering can be a pow- ers. AdaBoost is the most common implementation of
erful tool for e-mail spam filtering. Boosting. Boosting for filtering of spam messages was
Prior to this, Sasaki and Shinnou [Sasaki and Shin- first reported by [Carreras and Marquez, 2001]. [An-
nou, 2005] had proposed spam detection technique mak- droutsopoulos et al, 2006] compared four most promis-
ing use of the text clustering through a vector space ing learning algorithms from earlier work - LogitBoost,
model. [Basavaraju and Prabhakar, 2010] presented an Nave Bayes, Flexible Bayes and linear SVM. The au-
effective clustering algorithm integrating K-means and thors studied the role of attributes characterizing n-
BIRCH algorithm features. K-means algorithm worked grams frequencies and explored the effect of attribute
well for small scale data sets. BIRCH with K-Nearest size and training set in a cost-sensitive framework con-
Neighbour Classifier (K-NNC) was found to be the ideal text. Using evaluation measures as in [Androutsopou-
combination as it performed better with large data sets. los et al, 2000b], and the PU1 corpus in experiments,
[Debarr and Wechsler, 2009] relied on using term fre- [Carreras and Marquez, 2001] proved the definite effect
quency and inverse document frequency representation of boosting in decision-tree filters. Methods based on
for e-mails and employed the Partitioning Around Medoids boosting outperformed Naive Bayes and Decision Trees
(PAM) clustering algorithm to cluster a uniform sample algorithms when tested on the PU1 corpus. [Sakkis et al,
of 25% of messages in the training pool. Clustering com- 2001] experimented with combining a memory-based
bined with Random Forests for classification and active classifier with a Naive Bayes filter with another memory-
learning for refinement produced the best Area Under based classifier as president in a stacking framework.
Curve (AUC) of 95.2%. These works conclude that em- They achieved impressive precision and recall and con-
ploying the ham/spam clusters is a effective method for cluded that stacking consistently raises the performance
spam detection and because a ham/spam split is a nat- of the overall filter. He and Thiesson [He and Thiesson,
ural clustering for an e-mail spam dataset, clustering 2007] proposed a new asymmetric boosting method -
techniques should be investigated further as a tool for Boosting with Different Costs and applied it to spam fil-
more robust content based spam filters. tering. [Neumayer, 2006], [Shi et al, 2012], and [Blanco
et al, 2007] also discuss the application of an ensemble
learning to spam filtering.
5.4 Ensemble Classifiers
6 Evaluation Measures and Benchmarks
Ensemble learning is a novel technique where a set of in-
dividual classifiers are trained and brought together to Ideally spam filters should be evaluated on large, pub-
enhance the classification accuracy of the overall system licly available spam and ham databases. Sometimes Ac-
on the same problem (spam detection). An ensemble of curacy (Acc), the ratio of messages correctly classifies
classifiers is very effective for classification tasks and of- is used as an integrated measure for performance. If NL
fers good generalization. Spam filters have to deal with and and NS signify the number of legitimate messages
a diversity of spams, so it needs to continually evolve in and spam messages to be classified, then we define Ac-
order to detect new types of spam (future spam), and curacy (Acc) and Error (Err) of the spam filter as -
at the same time not allow ‘classical’ spam to evade
|L→L|+|S→S| |L→S|+|S→L|
the filter. Therefore, [Guerra et al, 2010] had suggested Acc = NL +NS and Err = 1 - Acc = NL +NS
that combining old and new filters (e.g. using ensem-
ble classifiers) may be an interesting strategy to deal Accuracy and Error consider both False Positive
with the diversity of spams. The most popular ensem- |L → S| and False Negative |S → L| events to carry
ble classifiers are bagging and boosting. equal cost. However, spam filtering involves asymmetric
18 Alexy Bhowmick, Shyamanta M. Hazarika
error costs. Failing to identify a ham, i.e. misclassifying False positives are considerably more expensive (λ
a ham as spam (a False Positive event) is generally a times) when compared with false negatives [Androut-
costlier mistake than missing a spam (a False Negative sopoulos et al, 2000a] [Androutsopoulos et al, 2006].
event). For e.g. A business letter from the boss or a per- Here, λ is a parameter that specifies how ‘dangerous’
sonal message from a spouse quarantined (and delayed) or ‘costly’ it is to misclassify legitimate e-mail as spam
or deleted can lead to serious consequences, while seeing and reflects the extra effort it requires from the user
a spam in our inbox may cause just a slight irritation. to recover from failures of the filter. For many users
True Positive event |L → L| is when a ham e-mail is false positives are unacceptable. [Androutsopoulos et al,
correctly classified as ham. True Negative event |S → S| 2006] suggested this cost sensitivity be taken into ac-
is when a spam e-mail is correctly classified as spam. count by treating each legitimate message to be equal
With this in mind, the False Positive Rate (FPR) - the to λ messages. Cost-sensitive measures Weighted Accu-
proportion of legitimate e-mails identified as spam is racy (WAcc), Weighted Error Rate (WErr) and Total
represented as - Cost Ratio (TCR) [Clark, 2008] are used as shown in
the formula.
#of F alseP ositives
FPR = #of F alseP ositives+#of T rueN egatives
λ|L→L|+|S→S|
WAcc = NL +NS and WErr = 1 - WAcc =
Again, failing to identify spam e.g. e-mails contain- λ|L→S|+|S→L|
NL +NS
ing viruses, worms, or phishing baits as payload can
incur significant risks to the user. False Negative Rate
NS
(FNR) i.e. the proportion of spam messages that were TCR = λ|L→S|+|S→L|
classified as legitimate, is another suitable measure.
The Total Cost Ratio is used to compare the effec-
#of F alseN egatives
FNR = #of T rueP ositives+#of F alseN egatives tiveness of a filter for a given λ when compared with a
baseline setting [Guzella and Caminhas, 2009]. It is an
Superior spam classifiers have lower FPR and FNR. evidence of the improvement brought about by the fil-
The two-dimensional quantity (FNR, FPR) denotes the ter. This cost-sensitive evaluation uses the λ parameter
effectiveness of hard classifiers while the effectiveness of to adjust the weight of a false positive. There are three
soft classifiers may be denoted by a set of such pairs values for λ used commonly in spam literature, λ = 1,
defining a curve - an ROC (Receiver Operating Char- 9, 999 [Androutsopoulos et al, 2000b], [Androutsopou-
acteristics) curve. ROC analysis are an excellent per- los et al, 2000a], [Androutsopoulos et al, 2006], [Sakkis
formance metric in spam filtering. A spam filter whose et al, 2001] and [Clark, 2008]. These values represent the
ROC curve strictly lies above that of another is the bet- situations when a false positive equals a false negative,
ter filter in all deployment scenarios. [Cormack, 2008]. or a false positive is 9 times a costlier mistake than a
Two measures borrowed from Information Retrieval false negative, or 999 times costlier. Greater TCR values
‘Recall ’ and ‘Precision’ are often used for capturing indicate superior performance. F-measure or F-score is
the effectiveness and quality of spam filters respectively another combining measure that combines both Preci-
[Androutsopoulos et al, 2000a]. If |S → L| signifies the sion (Ps ) and Recall (Rs ) metrics in one equation. It
number of spam messages classified as legitimate, and can be interpreted as the weighted harmonic mean of
|S → S| signifies the number of legtimate messages clas- both.
sified as spam respectively, and likewise for |L → L|
and |L → S| then Spam Recall (Rs ) and Spam Preci- F − measure = 2.P recision.Recall
P recision+Recall
sion (Ps ) are defined by the equations:
|S→S| |S→S|
Rs = |S→S|+|S→L| and Ps = |S→S|+|L→S|
Number of Messages/Images
Corpus Name Spam Rate Year of Creation Reference/Used
Spam Ham
making the rebuilding of the model imperative (called spam analysis where much work needs to be been done
virtual concept drift). Spam filtering is a dynamic prob- on leveraging existing algorithms.
lem that involves concept drift. While the understand-
ing of an unwanted message may remain the same, the
statistical properties of the spam e-mail changes over
time since it is driven by spammers involved in a never- 7.3 Emerging Spam Threats
ending arms race with spam filters. Another reason for
concept drift could be the different products or scams One of the biggest spam problems today even as spam
driven by spam that tend to become popular. The dy- e-mail volumes associated with botnets are receding
namic nature of spam is one of its most testing aspects. is the snowshoe spam. Showshoe spamming is a tech-
An effective spam filter must be able to track target nique that uses multiple IP addresses, websites and
concept drift and swiftly adapt to it. Research on con- sub-networks to send spam, so as to avoid detection
cept drift confirms lazy learning techniques to be the by spam filters. The term ‘snowshoe’ spam describes
most effective models against concept drift [Tsymbal, how some spammers distribute their load across a larger
2004], [Tsymbal et al, 2008]. Most of the earlier evalua- surface to keep from sinking, just as snowshoe wear-
tions did not try to deal with concept drift, or with real- ers do [McAfee, 2012] [Sophos, 2013]. Social networks
world datasets that have some concept drift. Few au- have also become a hunting ground for spammers. With
thors tried to address concept drift in spam filtering us- many users migrating to social networks as a means of
ing Case-Base Reasoning [Delany et al, 2005], Instance- communication, spammers are diversifying in order to
Based Reasoning [Fdez-Riverola et al, 2007b], Ensem- stay in business. The personal information revealed in
ble Learning [Tsymbal et al, 2008], Language Model social networks is gleaned by spammers to target un-
technique [Hayat et al, 2010]. A particular challenge in suspecting victims with tailored e-mails.
handling concept drift is in distinguishing between true
concept drift and noise. Research in concept drift is a
very active area in spam filtering.
7.4 Prioritising E-mails
Cranor LF, Lamacchia BA (1998) Spam! Communica- Fumera G (2006) Spam Filtering Based On The Anal-
tions of the ACM 41(8) ysis Of Text Information Embedded Into Images.
Crocker D (2009) Internet Mail Architecture - RFC Journal of Machine Learning Research (special issue
5598. Tech. rep., URL https://tools.ietf.org/ on Machine Learning in Computer Security) 7:2699–
html/rfc5598 2720
Cyberoam (2014) Internet Threats Trend Report 2014. Gansterer WN, Ecker GF (2008) On the Relationship
Tech. Rep. April, Cyberoam Between Feature Selection and Classification Accu-
Debarr D, Wechsler H (2009) Spam Detection using racy. Journal of Machine Learning Research 4:90–105
Clustering , Random Forests , and Active Learning. Garriss S, Kaminsky M, Freedman MJ, Karp B,
In: CEAS 2009 Sixth Conference on Email and Anti- Mazières D, Yu H (2006) RE : Reliable Email. In:
Spam NSDI’06 Proceedings of the 3rd Conference on Net-
Delany SJ, Bridge D (2006) Feature based and Feature worked Systems Design & Implementation, pp 22–22
free Textual CBR : a Comparison in Spam Filtering. Golbeck J, Hendler J (2004) Reputation Network Anal-
In: Proceedings of the 17th Irish Conference on Arti- ysis for Email Filtering Creating the Reputation Net-
ficial Intelligence and Cognitive Science (AICS ’06), work. In: Proceedings of the First Conference on
pp 244–253 Email and Anti-Spam, Mountain View, California.
Delany SJ, Cunningham P, Tsymbal A, Coyle L Gomez JC, Boiy E, Moens MF (2012) Highly Dis-
(2005) A Case-based Technique for Tracking Con- criminative Statistical Features for Email Classifica-
cept Drift in Spam Filtering. Knowledge-Based Sys- tion. Knowledge and Information Systems 31(1):23–
tems 18(4-5):187–195, DOI 10.1016/j.knosys.2004. 53, DOI 10.1007/s10115-011-0403-7, URL http://
10.002, URL http://linkinghub.elsevier.com/ link.springer.com/10.1007/s10115-011-0403-7
retrieve/pii/S0950705105000316 Goodman BJ, Cormack GV, Heckerman D (2007) Spam
Diao Y, Lu H, Wu D (2003) A Comparative Study and the Ongoing Battle for the Inbox. Communica-
of Classification Based Personal E-mail Filtering. In: tions of the ACM 50(2):24–33
Knowledge Discovery and Data Mining. Current Is- Graham P (2002a) A Plan for Spam. URL http://
sues and New Applications, pp 408–419 www.paulgraham.com/spam.html
Dredze M, Schilit BN, Norvig P (2009) Suggesting Graham P (2002b) Will Filters Kill Spam? URL http:
Email View Filters for Triage and Search. In: Pro- //www.paulgraham.com/wfks.html
ceedings of International Joint Conference on Artifi- Graham P (2003) Better Bayesian Filtering. URL
cial Intelligence (IJCAI), pp 1414–1419 http://www.paulgraham.com/better.html
Drucker H, Wu D, Vapnik VN (1999) Support Vector Graham-Cumming J (2004) How to Beat an Adaptive
Machines for Spam Categorization. IEEE Transac- Spam Filter. In: The Spam Conference
tions on Neural Networks 10(5):1048–1054 Graham-Cumming J (2006) Does Bayesian Poi-
Fawcett T (2004) ”In vivo” Spam Filtering: A Chal- soning Exist? Virus Bulletin URL https:
lenge Problem for Data Mining. SIGKDD Explo- //www.virusbtn.com/spambulletin/archive/
rations 5(2):140–148, URL http://arxiv.org/abs/ 2006/02/sb200602-poison.dkb?url=/archive/
cs/0405007, 0405007 2006/02/sb200602-poison
Fdez-Riverola F, Iglesias E, Dı́az F, Méndez J, Cor- Greenberg A (2010) The Most Common
chado J (2007a) Applying Lazy Learning Algorithms Words In Spam Email. URL http://www.
to Tackle Concept Drift in Spam Filtering. Ex- forbes.com/sites/firewall/2010/03/17/
pert Systems with Applications 33(1):36–48, DOI 10. the-most-common-words-in-spam-email/
1016/j.eswa.2006.04.011, URL http://linkinghub. Guerra PHC, Guedes D, Jr WM, Hoepers C, Chaves
elsevier.com/retrieve/pii/S0957417406001175 MHPC, Steding-jessen K (2010) Exploring the Spam
Fdez-Riverola F, Iglesias E, Dı́az F, Méndez J, Cor- Arms Race to Characterize Spam Evolution. In:
chado J (2007b) SpamHunting: An Instance-based CEAS 2010 - Seventh Collaboration, Electronic mes-
Reasoning System for Spam Labelling and Filtering. saging, Anti-Abuse and Spam Conference, Redmond,
Decision Support Systems 43(3):722–736, DOI 10. Washington USA
1016/j.dss.2006.11.012, URL http://linkinghub. Guyon I (2003) An Introduction to Variable and Fea-
elsevier.com/retrieve/pii/S0167923606002041 ture Selection. Journal of Machine Learning Research
Fette I, Sadeh N, Tomasic A (2007) Learning to Detect 3:1157–1182
Phishing Emails. In: Proceedings of the 16th Inter- Guzella TS, Caminhas WM (2009) A Review of
national Conference on World Wide Web, New York, Machine Learning Approaches to Spam Filtering.
NY, USA, pp 649–656 Expert Systems with Applications 36(7):10,206–
24 Alexy Bhowmick, Shyamanta M. Hazarika
10,222, DOI 10.1016/j.eswa.2009.02.037, URL Jorgensen Z, Zhou Y, Inge M (2008) A Multiple In-
http://linkinghub.elsevier.com/retrieve/ stance Learning Strategy for Combating Good Word
pii/S095741740900181X Attacks on Spam Filters. Journal of Machine Learn-
Gyongyi Z, Garcia-molina H (2005) Web Spam Taxon- ing Research 8:1115–1146
omy. In: 1st International Workshop on Adversarial Kanaris I, Kanaris K, Houvardas I, Stamatatos E
Information Retrieval on the Web (2006) Words vs. Character n-grams for Anti-spam
Hao S, Syed NA, Feamster N, Gray AG, Krasser S Filtering. International Journal on Artificial Intelli-
(2009) Detecting Spammers with SNARE : Spatio- gence Tools XX(X):1–20
temporal Network-level Automatic Reputation En- Kaspersky (2014) Kaspersky Security Bulletin 2014.
gine. In: Proceedings of 18th USENIX Security Predictions 2015. Tech. rep.
Hayat MZ, Basiri J, Seyedhossein L, Shakery A (2010) Katakis I, Tsoumakas G, Vlahavas I (2007) Email Min-
Content-Based Concept Drift Detection for Email ing : Emerging Techniques for Email Management.
Spam Filtering. In: 5th International Symposium on In: Vakali A, Pallis G (eds) Web Data Management
Telecommunications (IST’2010), pp 531–536 Practices: Emerging Techniques and Technologies,
He J, Thiesson B (2007) Asymmetric Gradient Boosting Idea Group Publishing, USA, chap 10
with Application to Spam Filtering. In: Fourth Con- Kiritchenko S, Matwin S, Abu-hakima S (2004) Email
ference on Email and Anti-Spam (CEAS), Mountain Classification with Temporal Features. In: Intelligent
View, California, USA Information Processing and Web Mining, Springer
Hershkop S (2006) Behavior-based Email Analysis with Berlin Heidelberg, pp 523–533
Application to Spam Detection. PhD thesis Kolari P, Java A, Finin T, Oates T, Joshi A (2006) De-
Hershkop S, Stolfo SJ (2005) Identifying Spam Without tecting Spam Blogs : A Machine Learning Approach.
Peeking at the Contents. ACM Crossroads, p 11 In: AAAI’06 Proceedings of the 21st National Con-
Hu Y, Guo C, Ngai E, Liu M, Chen S (2010) ference on Artificial Intelligence, pp 1351–1356
A Scalable Intelligent Non-content-based Spam- Koprinska I, Poon J, Clark J, Chan J (2007)
filtering Framework. Expert Systems with Appli- Learning to Classify E-mail. Information Sci-
cations 37(12):8557–8565, DOI 10.1016/j.eswa.2010. ences 177(10):2167–2187, DOI 10.1016/j.ins.2006.
05.020, URL http://linkinghub.elsevier.com/ 12.005, URL http://linkinghub.elsevier.com/
retrieve/pii/S0957417410004318 retrieve/pii/S0020025506003707
IBM (2012) IBM X-Force 2012 Mid-Year Trend Lai CC (2007) An Empirical Study of Three Machine
and Risk Report. Tech. rep., IBM, URL Learning Methods for Spam Filtering. Knowledge-
http://www.ibm.com/smarterplanet/global/ Based Systems 20(3):249–254, DOI 10.1016/j.knosys.
files/ca__en_us__security__xorce_2012_ 2006.05.016, URL http://linkinghub.elsevier.
midyear_trend_and_risk_report.pdf com/retrieve/pii/S0950705106001390
IBM (2014) IBM X-Force Threat Intelligence Quar- Lee K, Caverlee J, Webb S (2010a) Uncovering Social
terly, 4Q 2014. Tech. Rep. November, IBM, URL Spammers : Social Honeypots + Machine Learning.
http://public.dhe.ibm.com/common/ssi/ecm/ In: Proc. of 33rd Int. ACM SIGIR Conf. on Re-
wg/en/wgl03062usen/WGL03062USEN.PDF search and Development in Information Retrieval,
Irwin B, Friedman B (2008) Spam Construction Trends. New York, NY, USA, pp 435–442
In: Information Security for South Africa (ISSA), pp Lee SM, Kim DS, Kim JH, Park JS (2010b)
1–12 Spam Detection Using Feature Selection and Pa-
Isacenkova J, Balzarotti D (2011) Measurement and rameters Optimization. 2010 International Confer-
Evaluation of a Real World Deployment of a ence on Complex, Intelligent and Software Inten-
Challenge-Response Spam Filter. In: Proceedings of sive Systems (i):883–888, DOI 10.1109/CISIS.2010.
the 2011 ACM SIGCOMM Conference on Internet 116, URL http://ieeexplore.ieee.org/lpdocs/
Measurement Conference (IMC ’11), pp 413–426 epic03/wrapper.htm?arnumber=5447486
John JP, Moshchuk A, Gribble SD, Krishnamurthy A Leiba B, Ossher J, Segal R, Wegman M (2005) SMTP
(2009) Studying Spamming Botnets Using Botlab. Path Analysis. In: Proceedings of Second Conference
In: In USENIX Symposium on Networked Systems on Email and Anti-Spam, CEAS ’2005
Design and Implementation (NSDI) Li F, Hsieh Mh, Gburzynski P (2007) The Community
Jonathan B Postel (1982) Simple Mail Transfer Proto- Behavior of Spammers. URL http://web.media.
col - RFC 281. Tech. rep., URL https://www.ietf. mit.edu/~fulu/ClusteringSpammers.pdf
org/rfc/rfc0821.txt Liu C, Stamm S (2007) Fighting Unicode-Obfuscated
Spam
Machine Learning for E-mail Spam Filtering: Review, Techniques and Trends 25
for Incremental Spam Filtering. In: European Con- CS-2004-15). Tech. rep., Computer Science Depart-
ference on Machine Learning (ECML), pp 410–421 ment, Trinity College, Dublin, Ireland
Siponen M, Stucke C (2006) Effective Anti-spam Tsymbal A, Pechenizkiy M, Cunningham P (2008) Dy-
Strategies in Companies : An International Study. In: namic Integration of Classifiers for Handling Concept
Proceedings of the 39th Hawaii International Confer- Drift Dynamic Integration of Classifiers for Handling
ence on System Sciences (HICSS), vol 06, pp 1–10 Concept Drift. Information Fusion 9(1):56–68
Song Y, Kocz A, Giles CL (2009) Better Naive Bayes Wang CC, Chen SY (2007) Using Header Ses-
Classification for High-precision Spam Detection. sion Messages to Anti-spamming. Computers &
Software Practice and Experience (April):1003–1024, Security 26(5):381–390, DOI 10.1016/j.cose.2006.
DOI 10.1002/spe 12.012, URL http://linkinghub.elsevier.com/
Sophos (2013) Security Threat Report 2013. Tech. rep., retrieve/pii/S0167404807000065
Sophos Wang D, Irani D, Pu C (2013) A Study on Evolu-
Sophos (2014) Security Threat Report - 2014. Tech. tion of Email Spam Over Fifteen Years. In: Pro-
rep., Sophos, URL http://www.sophos.com/en-us/ ceedings of the 9th IEEE International Confer-
threat-center/medialibrary/PDFs/other/ ence on Collaborative Computing: Networking, Ap-
sophos-security-threat-report-2014.pdf plications and Worksharing (CollaborateCom), Icst,
Soranamageswari M, Meena C (2010) Statistical Fea- Austin, TX, USA, DOI 10.4108/icst.collaboratecom.
ture Extraction for Classification of Image Spam Us- 2013.254082, URL http://eudl.eu/doi/10.4108/
ing Artificial Neural Networks. In: 2010 Second Inter- icst.collaboratecom.2013.254082
national Conference on Machine Learning and Com- Wang S, Wang B, Lang H, Cheng X (2005) Using Non-
puting, Ieee, pp 101–105, DOI 10.1109/ICMLC.2010. Textual Information to Improve Spam Filtering Per-
72, URL http://ieeexplore.ieee.org/lpdocs/ formance. In: CAS-ICT at Text REtrieval Conference
epic03/wrapper.htm?arnumber=5460761 (TREC) 2005 SPAM Track
Stepp M (2005) PhishHook : A Tool to Detect and Pre- Whissell JS, Clarke CLA (2011) Clustering for Semi-
vent Phishing Attacks. In: In DIMACS Workshop on Supervised Spam Filtering. In: Proceedings of the 8th
Theft in E-Commerce: Content, Identity, and Service Annual Collaboration, Electronic messaging, Anti-
Stern H, Mason J, Shepherd M (2004) A Linguistics- Abuse and Spam Conference (CEAS ’11), pp 125–134
based Attack on Personalised Statistical E-mail Wittel GL, Wu SF (2004) On Attacking Statistical
Classiers. Tech. rep., Dalhousie Univ, URL Spam Filters. In: CEAS: First Conference on Email
https://www.cs.dal.ca/sites/default/files/ and Anti-Spam, Mountain View, CA
technical_reports/CS-2004-06.pdf Woitaszek M, Shaaban M (2003) Identifying Junk Elec-
Symantec (2014) Internet Security Threat Report. tronic Mail in Microsoft Outlook with a Support Vec-
Tech. Rep. April, Symantec tor Machine. In: Proceedings of the 2003 symposium
Taylor B, Fingal D, Aberdeen D (2007) The War on applications and the internet, SAINT, pp 166–169
Against Spam : A report from the Front Line. In: Wu Ch (2009) Behavior-based Spam Detection us-
Workshop on Machine Learning in Adversarial En- ing a Hybrid Method of Rule-based Techniques and
vironments for Computer Security (NIPS 2007), pp Neural Networks. Expert Systems With Applica-
1–3 tions 36(3):4321–4330, DOI 10.1016/j.eswa.2008.03.
Toolan F, Carthy J (2010) Feature Selection for Spam 002, URL http://dx.doi.org/10.1016/j.eswa.
and Phishing Detection. In: eCrime Researchers 2008.03.002
Summit (eCrime), 2010, pp 1–12 Wu CH, Tsai CH (2008) Robust Classification
Tretyakov K (2004) Machine Learning Techniques in for Spam Filtering by Back-propagation Neu-
Spam Filtering. In: Data Mining Problem-oriented ral Networks using Behavior-based Features. Ap-
Seminar, MTAT, May, pp 60–79 plied Intelligence 31(2):107–121, DOI 10.1007/
Tseng CY, Chen MS (2009) Incremental SVM s10489-008-0116-0, URL http://link.springer.
Model for Spam Detection on Dynamic Email com/10.1007/s10489-008-0116-0
Social Networks. In: 2009 International Confer- Wu X, Kumar V, Ross Quinlan J, Ghosh J, Yang Q,
ence on Computational Science and Engineer- Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou
ing, Ieee, pp 128–135, DOI 10.1109/CSE.2009. ZH, Steinbach M, Hand DJ, Steinberg D (2007) Top
260, URL http://ieeexplore.ieee.org/lpdocs/ 10 Algorithms in Data Mining, vol 14. DOI 10.1007/
epic03/wrapper.htm?arnumber=5284281 s10115-007-0114-2, URL http://link.springer.
Tsymbal A (2004) The Problem of Concept Drift : Def- com/10.1007/s10115-007-0114-2
initions and Related Work (Technical Report TCD-
Machine Learning for E-mail Spam Filtering: Review, Techniques and Trends 27
Xie Y, Yu F, Achan K, Panigrahy R, Hulten G, Osip- IEEE Trransactions on Information Forensics and Se-
kov I, Communication CC, Network N (2008) Spam- curity 6(2):486–497
ming Botnets : Signatures and Characteristics. In: Zhuang L, Dunagan J, Simon DR, Wang HJ, Tygar
Proceedings of ACM SIGCOMM08, Seattle, WA JD (2008) Characterizing Botnets from Email Spam
Yang J, Liu Y, Liu Z, Zhu X, Zhang X (2011) A Records. In: LEET 08: First USENIX Workshop on
New Feature Selection Algorithm based on Binomial Large-Scale Exploits and Emergent Threat
Hypothesis Testing for Spam Filtering. Knowledge-
Based Systems 24(6):904–914, DOI 10.1016/j.knosys.
2011.04.006, URL http://linkinghub.elsevier.
com/retrieve/pii/S0950705111000724
Yang J, Liu Y, Zhu X, Liu Z, Zhang X (2012) A
New Feature Selection based on Comprehensive Mea-
surement both in Inter-category and Intra-category
for Text Categorization. Information Processing
& Management 48(4):741–754, DOI 10.1016/j.ipm.
2011.12.005, URL http://linkinghub.elsevier.
com/retrieve/pii/S030645731100118X
Yang Y, Pedersen JO (1997) A Comparative Study on
Feature Selection in Text Categorization. In: ICML
’97 Proceedings of the Fourteenth International Con-
ference on Machine Learning, pp 412–420
Yang Y, Yoo S, Lin F, Moon IC (2010) Personal-
ized Email Prioritization Based on Content and
Social Network Analysis. IEEE Intelligent Systems
25(4):12–18
Yeh Cy, Wu CH, Doong SH (2005) Effective Spam Clas-
sification based on Meta-Heuristics. In: Proceedings
of IEEE International Conference on Systems, Man
and Cybernetics, pp 3872 – 3877
Yerazunis WS (2003) Sparse Binary Polynomial Hash-
ing and the CRM114 Discriminator Rough Guide to
this Talk. In: MIT Spam Conference
Yerazunis WS (2004) The Spam-Filtering Accuracy
Plateau at 99 . 9 percent Accuracy and How to Get
Past It. In: MIT Spam Conference
Youn S, Mcleod D (2006) A Comparative Study for
Email Classification. In: Proceedings of International
Joint Conferences on Computer, Information, Sys-
tem Sciences, and Engineering (CISSE06), Bridge-
port, CT
Yu B, Xu Zb (2008) A Comparative Study for
Content-based Dynamic Spam Classification using
Four Machine Learning Algorithms. Knowledge-
Based Systems 21(4):355–362, DOI 10.1016/j.knosys.
2008.01.001, URL http://linkinghub.elsevier.
com/retrieve/pii/S0950705108000026
Zhang L, Zhu J, Yao T (2004) An Evaluation of Sta-
tistical Spam Filtering Techniques Spam Filtering
as Text Categorization. ACM Transactions on Asian
Language Information Processing (TALIP) 3(4):243–
269
Zhu Y, Tan Y (2011) A Local-Concentration-Based
Feature Extraction Approach for Spam Filtering.