Intelligent Spam Classification for
Mobile Text Message
Kuruvilla Mathew
Biju Issac
School of Engineering, Computing and Science
Swinburne University of Technology (Sarawak Campus)
Kuching, Malaysia
kmathew@swinburne.edu.my
School of Engineering Computing and Science
Swinburne University of Technology (Sarawak Campus)
Kuching, Malaysia
bissac@swinburne.edu.my
Abstract—This paper analyses the methods of intelligent spam
filtering techniques in the SMS (Short Message Service) text
paradigm, in the context of mobile text message spam. The
unique characteristics of the SMS contents are indicative of the
fact that all approaches may not be equally effective or efficient.
This paper compares some of the popular spam filtering
techniques on a publically available SMS spam corpus, to
identify the methods that work best in the SMS text context. This
can give hints on optimized spam detection for mobile text
messages.
the empirical results. Section V describes the experiments and
outcomes in detail. We discuss the conclusion and the
prospective future work and extensions in the area under
consideration in Section VI.
Keywords-SMS spam;
Classifier; Mobile Spam
Duan and Huang has discussed on the dual filtering
approach making use of the combination of KNN classification
algorithm and rough set to separate spam from ham [2]. This
was shown as having an improvement in speed of classification
while retaining the high accuracy.
I.
Intelligent
classification;
Bayes
INTRODUCTION
Short messaging service (henceforth referred to as SMS) is
an inseparable part of modern society with the explosive
penetration of the mobile phones. Spammers take advantage of
this fact and make use of SMS message to reach potential
customers to drive their business interest [1]. This issue is
growing by the day, thereby necessitating a mechanism for
mobile SMS spam filtering. Mobile SMS spam filtering
challenge is similar to email spam filtering, with the difference
that they can send a limited number of characters only. It can
be noticed that due to this limitation, almost all spam SMS text
may contain a very close pattern. It incorporates some “catch
words” to attract potential “customers” and then some contact
information, usually a call back number, reply SMS number or
a URL (Uniform Resource Locator) that they can visit, at the
least, a keyword that they can search [2].
The fact that the number of characters in each message is
limited should make it possible for the search methods to come
out with possible better results. The spam filtering problem
essentially is a case of text classification [1]. We will evaluate
various algorithms used for spam filtering on SMS spam
corpus in an attempt to identify the better methods so that they
can be further optimized for the SMS text paradigm.
This paper is organized as follows. Section II discusses the
related work in the areas of machine learning and its
application in the area of SMS text classification. Section III
introduces the concept of spam and its consideration in the
SMS text paradigm. Section IV presents information about the
data collected for the experiments that enabled us to arrive at
II.
RELATED WORKS
Since SMS spam is a growing problem and is expected to
be a sizable issue in future. Some of the related work in the
area is noted as follows.
The Bayesian learning approach [3] was proposed by Zang
and Wang, making use of the Bayesian learning theory and its
application on the SMS paradigm for spam filtration. In this
approach, word segmentation is carried out using ICTCLAS
(Institute of Computing Technology Lexical Analysis System)
for word segmentation on which the Bayesian classification
method can be applied.
Paul Graham has discussed in detail regarding the use of
machine learning techniques in the context of spam detection
[1]. He discusses in detail a comparison between the classical
approaches and machine learning techniques in battling spam.
Rick and Peter's proposal of a method to filter spam SMS
by including a SMS message discriminating module in the
routing node was awarded USPTO No: 6,819,932 B2 (on 16 th
of Nov 2004 [4].
Freund & Schapire proposed method of boosting a weak
algorithm called AdaBoost that can be applied for classification
of text in the machine learning paradigm [5].
Implementation of the voted perceptron algorithm was
proposed by Freund and Schapire as a simpler algorithm for
linear classification which takes advantage of data that are
linearly separable with large margins [6].
Mccallum and Nigam discussed in calculating the
probability of a document, multiplying the probability of the
words that occur, with the understanding that the individual
word occurrences called “events" and the document to be the
collection of word events, called the multinomial event model
and its application in the Native Bayes generative model [7].
The BM algorithm, developed by Boyer and Moore in
1977, is a pattern matching algorithm that can be applied to
filter SMS text on a RT-filtering system was proposed by Jun
Liu et. all [8]. One important feature of BM algorithm is that
algorithm achieves a high level of execution efficiency using a
leap match which does not need to match each word.
The behavior based social network and temporal (spectral)
analysis proposed by Wang et. all. suggests that it detect
spammers with high precision and recall [9].
One such method proposed by Shahreza and Shahreza,
Peizhou He et. all and Peizhou He et. all makes use of
CAPTCHA (Completely Automatic Public Turing test to tell
Computers and Humans Apart), which will send a responsequestion, usually an image that requires a human to identify it,
thus validating it as a user-activated sequence and hence
legitimate. This system can also be combined with a black and
white list where the CAPCHA test is bypassed if the sender
appears either in the black list or white list [10], [11], [12].
A possible practical issue with the CAPTCHA approach is
that we are assuming that use of an alternate technology like
MMS (Multimedia Message Service) is available to make this
possible.
III.
INTRODUCTION TO SMS SPAM
A. What is a spam?
The definition of a spam does not vary much in the case of
emails or SMS Spam. In simple terms, it can be described as
“Unsolicited Bulk Messages”. These are usually unwanted
information being pushed to the users, with the appearances of
advertisements, tricks and cheating information [13]. The
spammers can be businessmen, and they send spam because it
works, in the form of responses that they receive to their
messages [1].
B. Spam Filtration
It is very easy for anyone to identify a spam message just
by reading through it. Our challenge in spam filtration is to
solve this this problem using fairly simple algorithms [1]. The
most common classical approaches that uses white-lists and
black-lists does not work as it is only capable of blocking an
entire server or source from sending messages, which can
include legitimate messages (too many false positives) as well.
Hence the problem of spam filtration is essentially a case of
text classification [1].
It is possible to look at the data to identify the keywords
that get the best hits, but it is fairly easy for spammers to
circumvent it using methods like using homophones, pinyin,
variant words etc [2]. An effective method seen thus far is
statistical machine learning approaches which are able to
“learn” features of a spam from the training data. The
advantage of this method is that as spam evolves, these
methods will also adapt. It is noticed that the Bayesian
approach is one of the more effective methods.
C. Introducing the SMS
The SMS or “short messaging service” offered by the
telecommunication companies allows users to communicate
among each other using simple character based messages of
160 characters (70 characters in the Chinese language) [2]. If
more characters are sent, they are sent as separate messages.
The format of an SMS message is as follows:
sender: 0123456789
message: I am in a meeting now, can we meet after 4?
receiver: 3334445656
time-stamp: 12:30:45 2011.8.20
D. SMS Spam
Spam in the SMS context is very similar to email spams,
typically, unsolicited bulk messaging with some business
interest [1]. However, the context of SMS imposes restriction
to the message that they can only be a limited number of
characters, which includes alphabets, numbers and a few
symbols. This seriously restricts the amount and format of
information that a spammer can send. A look through the
messages reveals a very clear pattern which should be
attributed to this restriction. Almost all of the spam messages
ask the users to call a number, reply by SMS or visit some
URL to avail some offers, products or savings. This pattern is
noticeable by the result returned by a simple SQL query on the
spam corpus (section V - C). This was able to return extremely
high effectiveness using very few (only 10) feature words
along with the presence of a number.
IV.
DATA COLLECTION AND PROCESSING
The testing of the intelligent methods needs to be done on
SMS messages so that a conclusion can be drawn on the topic
in consideration. We made use of a fairly large collection of
SMS from reference [14] with over 5000 separate text
messages of which about 15% are spam messages.
The SMS Spam Collection v.1 is a public set of SMS
labeled messages that have been collected for mobile
phone spam research. It has one collection composed
by 5,574 English, real and non-encoded messages,
tagged according being legitimate (ham) or spam..
The algorithms we are interested in do not read strings and
so we convert the data into feature vectors. We have used the
Weka “StringToWordVector” function for this conversion.
V.
EVALUATION AND DISCUSSION
A. Spam Filtration Techniques
The list of spam filtration algorithms in consideration for
this observation is listed in Table I. The methods were selected
from a list of methods implemented in the Weka Project [15].
Weka is a collection of machine learning algorithms for data
mining tasks. Weka is open source software issued under the
GNU General Public License [15]. We make use of this
collection and apply the various algorithms on the SMS spam
corpus to compare the effectiveness of each in an attempt to
identify the ones that perform better in the sms paradigm.
TABLE I.
LIST OF ALGORITHMS CONSIDERED
Machine Learning Algorithms Effectiveness
S. No
Algorithms
Effectiveness
1.
Bayes Net
97.22%
2.
Bayesian Logistic Regression
97.29%
3.
Compliment Native Bayes
95.43%
4.
DMNB text
97.14%
5.
Native Bayes Multinominal
98.22%
6.
Native Bayes
93.50%
7.
Naive Bayes Updateable
93.50%
8.
Voted Perceptron
97.00%
9.
Logistic
93.43%
10.
Multi layer Perceptron
Fail
11.
RBF Network
Fail
12.
Simple Logistic
13.
SMO
97.36%
14.
SPegasos
97.07%
15.
Lazy IB
89.72%
16.
Lazy IBK
94.29%
17.
Lazy KStar
95.15%
18.
Lazy LWL
19.
Ada Boost M1
96.79%
20.
Attribute Selected Classifier
94.00%
21.
Classification via clustering
75.37%
22.
Clasification via regression
91.29%
23.
CV Parameter selection
86.58%
24.
Dragging
93.29%
25.
Filetered Classifier
95.93%
26.
Logit Boost
93.08%
27.
Multi Boost AB
88.15%
28.
Raced Incremental Logit Boost
90.01%
29.
Conjunctive Rule
90.58%
30.
Decision Table
Fail
31.
JRip
Fail
32.
Nnge
Fail
Fail
Fail
a. Fail means did not complete execution due to very long execution time or excessive memory usage
(Sequential Minimal Optimization) even though it had a
97.36% as it took more than 1 minute to complete execution.
1) DMNB Text
DMNB Text (Discriminative Multinominal Naive Bayes) is
a simple Bayesian classifier with discriminative parameter
learning for text categorization. This is a class for building and
using a discriminative multinomial naive Bayes classifier
proposed by Su et. all., a simple, efficient, and effective
discriminative
parameter
learning
method,
called
Discriminative Frequency Estimate (DFE), which learns
parameters by discriminatively computing frequencies from
data [17].
2) Bayes Net (SimpleEstimator + K2)
Bayes Net is a Naive Bayesian learner [18]. This is a Bayes
network learning algorithm using simple estimator and k2. This
is the base class for a Bayes Network classifier and provides
data structures (network structure, conditional probability
distributions, etc.) and facilities common to Bayes Network
learning algorithms like K2 and B. SimpleEstimator is used for
estimating the conditional probability tables of a Bayes
network once the structure has been learned. K2 is a Bayes
Network learning algorithm that uses a hill climbing algorithm
restricted by an order on the variables.
3) Native Bayes Multinominal
This is a class for building and using a multinomial native
Bayes classifier, based on the proposal by Mccallum and
Nigam [7], [15]. When calculating the probability of a
document, one multiplies the probability of the words that
occur, with the understanding that the individual word
occurrences called “events" and the document to be the
collection of word events. We call this the multinomial event
model. The naive Bayes classifier is the simplest of these
models, in that it assumes that all attributes of the examples are
independent of each other given the context of the class, called
the “naive Bayes assumption."
TABLE II.
Preferred Algorithms
Rank
b. List of algorithms were selected from the implementations in Weka [7]
B. The Preferrred Algorithms
As we expected, the Bayesian methods worked best in most
cases. The best result was thrown up by the method “Native
Bayes Binomial” which gave a 98.22% correct classification
and was also observed to be fast. However, our preferred
algorithm is the Bayesian method “DNMB text” as this
returned a effectiveness of 97.14 with 0 false positives and
Bayes Net with an effectiveness of 97.22% and just 1 false
positive as a ham marked as spam is way too offensive than
allowing a spam to come through as a ham [16]. Based on
these, the preferred algorithms are listed in Table II. As it can
be seen, it is not necessarily the top performing algorithms that
we have chosen, giving preference to algorithms offering a
combination of good effectiveness and low “false positives”
and execution times. We have not selected the algorithm SMO
PREFERRED ALGORITHMS
Algorithms
%
Ham
Spam
×
×
1.
DMNB Text
97.1
1213
0
148
40
2.
Bayes Net
97.2
1212
1
150
38
3.
Native Bayes
Multinominal
98.2
1205
8
171
17
4.
Voted Perceptron
97.0
1206
7
153
35
5.
Ada Boost M1
96.8
1208
5
148
40
6.
Lazy KStar
95.1
1212
1
121
67
7.
J48 Trees
95.9
1206
7
138
50
8.
Lazy IBK
94.3
1206
7
115
73
94.0
1205
8
112
76
97.3
1202
11
161
27
9.
10.
Attribute Selected
Classifier
Baysean Logistic
Regression
a. % - Percentage Score, - Correctly Classified, × – Incorrectly Classified
b. * - A simple deterministic SQL query
4) Voted Perceptron
This is the implementation of the voted perceptron
algorithm proposed by Freund and Schapire which globally
replaces all missing values and transforms nominal attributes
into binary ones [6]. It was proposed as a simpler algorithm for
linear classification which takes advantage of data that are
linearly separable with large margins. This is an algorithm for
linear classification which combines Rosenblatt’s perceptron
algorithm with Helmbold and Warmuth’s leave-one-out
method.
5) Ada Boost M1
Boosting works by repeatedly running a given weak
learning algorithm on various distributions over the training
data, and then combining the classifiers produced by the weak
learner into a single composite classifier [5]. This is a class for
boosting a nominal class classifier using the AdaBoost M1
method, which can tackle only nominal class problems. This
often dramatically improves performance, but sometimes
overfits [15].
6) Lazy KStar
KStar (K*) is a type of lazy algorithm proposed by Cleary
& Trigg [19], an instance-based classifier that is the class of a
test instance is based upon the class of those training instances
similar to it, as determined by some similarity function. It
differs from other instance-based learners in that it uses an
entropy-based distance function.
7) J48 Trees
This method provides a class for generating a pruned or
unpruned C4.5 decision tree described in detail in his Book by
Ross [20]. This method combined with the unsupervised
stringToWordVector function was run under filtered classifier
option in Weka
8) Lazy IBK
This is a K-nearest neighbours classifier which can select
appropriate value of K based on cross-validation and also do
distance weighting. This is proposed by Aha, Kibler and Albert
as they discuss instance-based learning that generates
classification predictions using only specific instances [21].
9) Attribute Selected Classifier
The dimensionality of training and test data is reduced by
attribute selection before being passed on to a classifier. The
J48 classifier [20] was used along with the evaluator
CfsSubsetEval that evaluates the worth of a subset of attributes
by considering the individual predictive ability of each feature
along with the degree of redundancy between them and the best
first search which searches the space of attribute subsets by
greedy hillclimbing augmented with a backtracking facility.
10) Bayesian Logic Regression
This algorithm implements Bayesian Logistic Regression
for both Gaussian and Laplace Priors to avoid overfitting and
produces sparse predictive models for text data. Analysis of
high-dimensional data, like natural language text and this
method attempts to produce compact predictive models [22].
Figure 1. Analysis of the machine learning or evaluation methods
C. Observation – using deterministic query
Visually examining the corpus, we could notice a pattern as
mentioned above. In order to establish this, we did a simple
search using a simple SQL query with only 10 keywords, listed
below, together with the presence of a digit (number). We
observed that this very simple query with no optimization or
token selection methods, gave a phenomenal response of
97.18%.
Success %: 97.18
Ham – Correctly Classified: 4787, Incorrect: 91 (1.9%),
Spam – Correctly Classified: 658, Incorrect: 67 (9.2%)
1) The SQL Query
A simple deterministic SQL Query based on the pattern
noticed is noted as follows.
SQL Keywords used are as follows:
Words: “call”, “text”, “txt”, “SMS”, “win”, “free”,
“send”, “www”, “//”, “chat”
Digits: 0, 1, 3, 5, 7, 9. The digits 2, 4, and 8 were
ignored as these were used as variant words in ham
messages instead of “to” “for” and “ate” and the fact
that if it is a spam indicating a call-back number, then
there should be more than 1 digit and should be caught
by the presence of the other digits.
The query searches the presence of these 10 keywords
manually identified from the corpus, along with a number as
we have noticed that SMS spams generally always includes a
call back number or address.
Campus for his help in the work with the Weka tool. We also
thank Tiago and José for the SMS corpus [15].
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
Figure 2. Comparison of the preferred successful algorithms
[8]
VI.
CONCLUSION AND FUTURE WORK
The Bayesian methods was very effective just as we
expected, giving very high success percentage, up to 98%. This
indicates that the Bayesian technique is one of the best
approaches towards spam filtration optimizing this for better
performance on the SMS text context. However, the fact that
the success percentage of the deterministic SQL query ranks
among the top 5 intelligent methods indicates the possibility
that the SMS spam could be very balanced data with a clear
pattern, which should make possible the search to be much
simpler and faster.
However, the AI methods have a challenge in the fact that
these methods are very process intensive and also require more
memory in order to store the learning data. The SQL query
giving us a positive result indicates a possibility of optimizing
the Bayesian methods towards better effectiveness, efficiency
and simplicity by applying tokenization methods adapted to the
SMS paradigm along with possible keys to identifying a callback reference.
The current machine learning schemes are too machine
intensive to be applied on a client device, which may typically
be a smart phone. If this simplicity of the SMS pattern can
possibly allow us to simplify the algorithms towards improving
efficiency along with effectiveness, this can facilitate possible
deployment on low power client devices. This approach will
allow the algorithms to learn data personalized to each user
with the learning focused to individual users rather than one
implemented on a server that generates generalised results for
all users.
ACKNOWLEDGMENT
We would like to thank Amir Mohammad Shahi from
Swinburne University of Information Technology Sarawak
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
Paul Graham, (August 2002), A plan for spam, viewed: 28 September
2011, <http://paulgraham.com/spam.html>
Duan, L., Li, N., & Huang, L. (2009). “A new spam short message
classification” 2009 First International Workshop on Education
Technology and Computer Science, 168-171.
Zhang, H.-yan, & Wang, W. (2009). “Application of Bayesian method to
spam sms filtering”. 2009 International Conference on Information
Engineering and Computer Science, 1-3.
Rick L. Allison, & Peter J. Marsico, US Patent Document -6819932
“Methods and systems for preventing delivery of unwanted short
message service (SMS) messages”, (Nov 2004).
Freund, Y., Schapire, R. E., & Hill, M. (1996). “Experiments with a new
boosting algorithm”. Thirteenth International Conference on Machine
Learning, San Francisco, 148-156
Freund, Y., & Schapire, R. E. (1998). Large margin classification using
the perceptron algorithm. Proceedings of the eleventh annual conference
on Computational learning theory - COLT’ 98, 296, 209-217.
Mccallum, A., & Nigam, K. (1998). “A comparison of event models for
naive Bayes text classification”. AAAI-98 Workshop on 'Learning for
Text Categorization'
Liu, J., Ke, H., & Zhang, G. (2010). “Real-time sms filtering system
based on bm algorithm”. System, 6-8.
Wang, C et. all (2010), “A behavior-based SMS antispam system”, IBM
Journal of Research and Development, 3:1 - 3:16
Shirali-Shahreza, M. H., & Shirali-Shahreza, M. (2008). “An anti-smsspam using CAPTCHA”. 2008 ISECS International Colloquium on
Computing, Communication, Control, and Management, 318-321.
He, P., Sun, Y., Zheng, W., & Wen, X. (2008). “Filtering short message
spam of group sending using CAPTCHA”. First International Workshop
on Knowledge Discovery and Data Mining (WKDD 2008), 558-561.
He, P. (2008). “A Novel Method for Filtering Group Sending Short
Message Spam”. Proofs, 60-65.
Cai, J., Tang, Y., & Hu, R. (2008). “Spam filter for short messages using
winnow”. 2008 International Conference on Advanced Language
Processing and Web Information Technology, 454-459.
SMS
Spam
Collection
v.1,
viewed:
2011
August
9
<www.dt.fee.unicamp.br/~tiago/SMSspamcollection>
Weka The University of Waikato, Weka 3: Data Mining Software in
Java,
viewed
on
2011
September
14
<http://www.cs.waikato.ac.nz/ml/weka/>
Cormack, G. V., Hidalgo, J. M. G., & Sánz, E. P. (2007). “Feature
engineering for mobile (SMS) spam filtering”. Proceedings of the 30th
annual international ACM SIGIR conference on Research and
development in information retrieval - SIGIR ’07, 871.
Su, J., Zhang, H., Ling, C. X., & Matwin, S. (2008). “Discriminative
parameter learning for Bayesian networks”. Proceedings of the 25th
international conference on Machine learning - ICML ’08, 1016-1023.
Bayesian Network Classifiers in Weka, viewed on 2011 September 14
<http://www.cs.waikato.ac.nz/~remco/weka.bn.pdf,>
Cleary, J. G., & Trigg, L. E. (n.d.). K *: “An Instance-based Learner
Using an Entropic Distance Measure”, 12th International Conference on
Machine Learning, 108-114.
Ross Quinlan (1993). C4.5: Programs for Machine Learning. Morgan
Kaufmann Publishers, San Mateo, CA
D. Aha, D. Kibler, M Albert (1991). “Instance-based learning
algorithms”, Machine Learning, Kluwer Academic Publishers, 6:37-66.
Alexander Genkin, David D. Lewis, David Madigan (2004). “Largescale Bayesian logistic regression for text categorization”,
Technometrics. August 1, 2007, 49(3): 291-304