Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Intelligent Spam Classification for Mobile Text Message Kuruvilla Mathew Biju Issac School of Engineering, Computing and Science Swinburne University of Technology (Sarawak Campus) Kuching, Malaysia kmathew@swinburne.edu.my School of Engineering Computing and Science Swinburne University of Technology (Sarawak Campus) Kuching, Malaysia bissac@swinburne.edu.my Abstract—This paper analyses the methods of intelligent spam filtering techniques in the SMS (Short Message Service) text paradigm, in the context of mobile text message spam. The unique characteristics of the SMS contents are indicative of the fact that all approaches may not be equally effective or efficient. This paper compares some of the popular spam filtering techniques on a publically available SMS spam corpus, to identify the methods that work best in the SMS text context. This can give hints on optimized spam detection for mobile text messages. the empirical results. Section V describes the experiments and outcomes in detail. We discuss the conclusion and the prospective future work and extensions in the area under consideration in Section VI. Keywords-SMS spam; Classifier; Mobile Spam Duan and Huang has discussed on the dual filtering approach making use of the combination of KNN classification algorithm and rough set to separate spam from ham [2]. This was shown as having an improvement in speed of classification while retaining the high accuracy. I. Intelligent classification; Bayes INTRODUCTION Short messaging service (henceforth referred to as SMS) is an inseparable part of modern society with the explosive penetration of the mobile phones. Spammers take advantage of this fact and make use of SMS message to reach potential customers to drive their business interest [1]. This issue is growing by the day, thereby necessitating a mechanism for mobile SMS spam filtering. Mobile SMS spam filtering challenge is similar to email spam filtering, with the difference that they can send a limited number of characters only. It can be noticed that due to this limitation, almost all spam SMS text may contain a very close pattern. It incorporates some “catch words” to attract potential “customers” and then some contact information, usually a call back number, reply SMS number or a URL (Uniform Resource Locator) that they can visit, at the least, a keyword that they can search [2]. The fact that the number of characters in each message is limited should make it possible for the search methods to come out with possible better results. The spam filtering problem essentially is a case of text classification [1]. We will evaluate various algorithms used for spam filtering on SMS spam corpus in an attempt to identify the better methods so that they can be further optimized for the SMS text paradigm. This paper is organized as follows. Section II discusses the related work in the areas of machine learning and its application in the area of SMS text classification. Section III introduces the concept of spam and its consideration in the SMS text paradigm. Section IV presents information about the data collected for the experiments that enabled us to arrive at II. RELATED WORKS Since SMS spam is a growing problem and is expected to be a sizable issue in future. Some of the related work in the area is noted as follows. The Bayesian learning approach [3] was proposed by Zang and Wang, making use of the Bayesian learning theory and its application on the SMS paradigm for spam filtration. In this approach, word segmentation is carried out using ICTCLAS (Institute of Computing Technology Lexical Analysis System) for word segmentation on which the Bayesian classification method can be applied. Paul Graham has discussed in detail regarding the use of machine learning techniques in the context of spam detection [1]. He discusses in detail a comparison between the classical approaches and machine learning techniques in battling spam. Rick and Peter's proposal of a method to filter spam SMS by including a SMS message discriminating module in the routing node was awarded USPTO No: 6,819,932 B2 (on 16 th of Nov 2004 [4]. Freund & Schapire proposed method of boosting a weak algorithm called AdaBoost that can be applied for classification of text in the machine learning paradigm [5]. Implementation of the voted perceptron algorithm was proposed by Freund and Schapire as a simpler algorithm for linear classification which takes advantage of data that are linearly separable with large margins [6]. Mccallum and Nigam discussed in calculating the probability of a document, multiplying the probability of the words that occur, with the understanding that the individual word occurrences called “events" and the document to be the collection of word events, called the multinomial event model and its application in the Native Bayes generative model [7]. The BM algorithm, developed by Boyer and Moore in 1977, is a pattern matching algorithm that can be applied to filter SMS text on a RT-filtering system was proposed by Jun Liu et. all [8]. One important feature of BM algorithm is that algorithm achieves a high level of execution efficiency using a leap match which does not need to match each word. The behavior based social network and temporal (spectral) analysis proposed by Wang et. all. suggests that it detect spammers with high precision and recall [9]. One such method proposed by Shahreza and Shahreza, Peizhou He et. all and Peizhou He et. all makes use of CAPTCHA (Completely Automatic Public Turing test to tell Computers and Humans Apart), which will send a responsequestion, usually an image that requires a human to identify it, thus validating it as a user-activated sequence and hence legitimate. This system can also be combined with a black and white list where the CAPCHA test is bypassed if the sender appears either in the black list or white list [10], [11], [12]. A possible practical issue with the CAPTCHA approach is that we are assuming that use of an alternate technology like MMS (Multimedia Message Service) is available to make this possible. III. INTRODUCTION TO SMS SPAM A. What is a spam? The definition of a spam does not vary much in the case of emails or SMS Spam. In simple terms, it can be described as “Unsolicited Bulk Messages”. These are usually unwanted information being pushed to the users, with the appearances of advertisements, tricks and cheating information [13]. The spammers can be businessmen, and they send spam because it works, in the form of responses that they receive to their messages [1]. B. Spam Filtration It is very easy for anyone to identify a spam message just by reading through it. Our challenge in spam filtration is to solve this this problem using fairly simple algorithms [1]. The most common classical approaches that uses white-lists and black-lists does not work as it is only capable of blocking an entire server or source from sending messages, which can include legitimate messages (too many false positives) as well. Hence the problem of spam filtration is essentially a case of text classification [1]. It is possible to look at the data to identify the keywords that get the best hits, but it is fairly easy for spammers to circumvent it using methods like using homophones, pinyin, variant words etc [2]. An effective method seen thus far is statistical machine learning approaches which are able to “learn” features of a spam from the training data. The advantage of this method is that as spam evolves, these methods will also adapt. It is noticed that the Bayesian approach is one of the more effective methods. C. Introducing the SMS The SMS or “short messaging service” offered by the telecommunication companies allows users to communicate among each other using simple character based messages of 160 characters (70 characters in the Chinese language) [2]. If more characters are sent, they are sent as separate messages. The format of an SMS message is as follows: sender: 0123456789 message: I am in a meeting now, can we meet after 4? receiver: 3334445656 time-stamp: 12:30:45 2011.8.20 D. SMS Spam Spam in the SMS context is very similar to email spams, typically, unsolicited bulk messaging with some business interest [1]. However, the context of SMS imposes restriction to the message that they can only be a limited number of characters, which includes alphabets, numbers and a few symbols. This seriously restricts the amount and format of information that a spammer can send. A look through the messages reveals a very clear pattern which should be attributed to this restriction. Almost all of the spam messages ask the users to call a number, reply by SMS or visit some URL to avail some offers, products or savings. This pattern is noticeable by the result returned by a simple SQL query on the spam corpus (section V - C). This was able to return extremely high effectiveness using very few (only 10) feature words along with the presence of a number. IV. DATA COLLECTION AND PROCESSING The testing of the intelligent methods needs to be done on SMS messages so that a conclusion can be drawn on the topic in consideration. We made use of a fairly large collection of SMS from reference [14] with over 5000 separate text messages of which about 15% are spam messages.  The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-encoded messages, tagged according being legitimate (ham) or spam.. The algorithms we are interested in do not read strings and so we convert the data into feature vectors. We have used the Weka “StringToWordVector” function for this conversion. V. EVALUATION AND DISCUSSION A. Spam Filtration Techniques The list of spam filtration algorithms in consideration for this observation is listed in Table I. The methods were selected from a list of methods implemented in the Weka Project [15]. Weka is a collection of machine learning algorithms for data mining tasks. Weka is open source software issued under the GNU General Public License [15]. We make use of this collection and apply the various algorithms on the SMS spam corpus to compare the effectiveness of each in an attempt to identify the ones that perform better in the sms paradigm. TABLE I. LIST OF ALGORITHMS CONSIDERED Machine Learning Algorithms Effectiveness S. No Algorithms Effectiveness 1. Bayes Net 97.22% 2. Bayesian Logistic Regression 97.29% 3. Compliment Native Bayes 95.43% 4. DMNB text 97.14% 5. Native Bayes Multinominal 98.22% 6. Native Bayes 93.50% 7. Naive Bayes Updateable 93.50% 8. Voted Perceptron 97.00% 9. Logistic 93.43% 10. Multi layer Perceptron Fail 11. RBF Network Fail 12. Simple Logistic 13. SMO 97.36% 14. SPegasos 97.07% 15. Lazy IB 89.72% 16. Lazy IBK 94.29% 17. Lazy KStar 95.15% 18. Lazy LWL 19. Ada Boost M1 96.79% 20. Attribute Selected Classifier 94.00% 21. Classification via clustering 75.37% 22. Clasification via regression 91.29% 23. CV Parameter selection 86.58% 24. Dragging 93.29% 25. Filetered Classifier 95.93% 26. Logit Boost 93.08% 27. Multi Boost AB 88.15% 28. Raced Incremental Logit Boost 90.01% 29. Conjunctive Rule 90.58% 30. Decision Table Fail 31. JRip Fail 32. Nnge Fail Fail Fail a. Fail means did not complete execution due to very long execution time or excessive memory usage (Sequential Minimal Optimization) even though it had a 97.36% as it took more than 1 minute to complete execution. 1) DMNB Text DMNB Text (Discriminative Multinominal Naive Bayes) is a simple Bayesian classifier with discriminative parameter learning for text categorization. This is a class for building and using a discriminative multinomial naive Bayes classifier proposed by Su et. all., a simple, efficient, and effective discriminative parameter learning method, called Discriminative Frequency Estimate (DFE), which learns parameters by discriminatively computing frequencies from data [17]. 2) Bayes Net (SimpleEstimator + K2) Bayes Net is a Naive Bayesian learner [18]. This is a Bayes network learning algorithm using simple estimator and k2. This is the base class for a Bayes Network classifier and provides data structures (network structure, conditional probability distributions, etc.) and facilities common to Bayes Network learning algorithms like K2 and B. SimpleEstimator is used for estimating the conditional probability tables of a Bayes network once the structure has been learned. K2 is a Bayes Network learning algorithm that uses a hill climbing algorithm restricted by an order on the variables. 3) Native Bayes Multinominal This is a class for building and using a multinomial native Bayes classifier, based on the proposal by Mccallum and Nigam [7], [15]. When calculating the probability of a document, one multiplies the probability of the words that occur, with the understanding that the individual word occurrences called “events" and the document to be the collection of word events. We call this the multinomial event model. The naive Bayes classifier is the simplest of these models, in that it assumes that all attributes of the examples are independent of each other given the context of the class, called the “naive Bayes assumption." TABLE II. Preferred Algorithms Rank b. List of algorithms were selected from the implementations in Weka [7] B. The Preferrred Algorithms As we expected, the Bayesian methods worked best in most cases. The best result was thrown up by the method “Native Bayes Binomial” which gave a 98.22% correct classification and was also observed to be fast. However, our preferred algorithm is the Bayesian method “DNMB text” as this returned a effectiveness of 97.14 with 0 false positives and Bayes Net with an effectiveness of 97.22% and just 1 false positive as a ham marked as spam is way too offensive than allowing a spam to come through as a ham [16]. Based on these, the preferred algorithms are listed in Table II. As it can be seen, it is not necessarily the top performing algorithms that we have chosen, giving preference to algorithms offering a combination of good effectiveness and low “false positives” and execution times. We have not selected the algorithm SMO PREFERRED ALGORITHMS Algorithms % Ham  Spam ×  × 1. DMNB Text 97.1 1213 0 148 40 2. Bayes Net 97.2 1212 1 150 38 3. Native Bayes Multinominal 98.2 1205 8 171 17 4. Voted Perceptron 97.0 1206 7 153 35 5. Ada Boost M1 96.8 1208 5 148 40 6. Lazy KStar 95.1 1212 1 121 67 7. J48 Trees 95.9 1206 7 138 50 8. Lazy IBK 94.3 1206 7 115 73 94.0 1205 8 112 76 97.3 1202 11 161 27 9. 10. Attribute Selected Classifier Baysean Logistic Regression a. % - Percentage Score,  - Correctly Classified, × – Incorrectly Classified b. * - A simple deterministic SQL query 4) Voted Perceptron This is the implementation of the voted perceptron algorithm proposed by Freund and Schapire which globally replaces all missing values and transforms nominal attributes into binary ones [6]. It was proposed as a simpler algorithm for linear classification which takes advantage of data that are linearly separable with large margins. This is an algorithm for linear classification which combines Rosenblatt’s perceptron algorithm with Helmbold and Warmuth’s leave-one-out method. 5) Ada Boost M1 Boosting works by repeatedly running a given weak learning algorithm on various distributions over the training data, and then combining the classifiers produced by the weak learner into a single composite classifier [5]. This is a class for boosting a nominal class classifier using the AdaBoost M1 method, which can tackle only nominal class problems. This often dramatically improves performance, but sometimes overfits [15]. 6) Lazy KStar KStar (K*) is a type of lazy algorithm proposed by Cleary & Trigg [19], an instance-based classifier that is the class of a test instance is based upon the class of those training instances similar to it, as determined by some similarity function. It differs from other instance-based learners in that it uses an entropy-based distance function. 7) J48 Trees This method provides a class for generating a pruned or unpruned C4.5 decision tree described in detail in his Book by Ross [20]. This method combined with the unsupervised stringToWordVector function was run under filtered classifier option in Weka 8) Lazy IBK This is a K-nearest neighbours classifier which can select appropriate value of K based on cross-validation and also do distance weighting. This is proposed by Aha, Kibler and Albert as they discuss instance-based learning that generates classification predictions using only specific instances [21]. 9) Attribute Selected Classifier The dimensionality of training and test data is reduced by attribute selection before being passed on to a classifier. The J48 classifier [20] was used along with the evaluator CfsSubsetEval that evaluates the worth of a subset of attributes by considering the individual predictive ability of each feature along with the degree of redundancy between them and the best first search which searches the space of attribute subsets by greedy hillclimbing augmented with a backtracking facility. 10) Bayesian Logic Regression This algorithm implements Bayesian Logistic Regression for both Gaussian and Laplace Priors to avoid overfitting and produces sparse predictive models for text data. Analysis of high-dimensional data, like natural language text and this method attempts to produce compact predictive models [22]. Figure 1. Analysis of the machine learning or evaluation methods C. Observation – using deterministic query Visually examining the corpus, we could notice a pattern as mentioned above. In order to establish this, we did a simple search using a simple SQL query with only 10 keywords, listed below, together with the presence of a digit (number). We observed that this very simple query with no optimization or token selection methods, gave a phenomenal response of 97.18%. Success %: 97.18 Ham – Correctly Classified: 4787, Incorrect: 91 (1.9%), Spam – Correctly Classified: 658, Incorrect: 67 (9.2%) 1) The SQL Query A simple deterministic SQL Query based on the pattern noticed is noted as follows. SQL Keywords used are as follows:   Words: “call”, “text”, “txt”, “SMS”, “win”, “free”, “send”, “www”, “//”, “chat” Digits: 0, 1, 3, 5, 7, 9. The digits 2, 4, and 8 were ignored as these were used as variant words in ham messages instead of “to” “for” and “ate” and the fact that if it is a spam indicating a call-back number, then there should be more than 1 digit and should be caught by the presence of the other digits. The query searches the presence of these 10 keywords manually identified from the corpus, along with a number as we have noticed that SMS spams generally always includes a call back number or address. Campus for his help in the work with the Weka tool. We also thank Tiago and José for the SMS corpus [15]. REFERENCES [1] [2] [3] [4] [5] [6] [7] Figure 2. Comparison of the preferred successful algorithms [8] VI. CONCLUSION AND FUTURE WORK The Bayesian methods was very effective just as we expected, giving very high success percentage, up to 98%. This indicates that the Bayesian technique is one of the best approaches towards spam filtration optimizing this for better performance on the SMS text context. However, the fact that the success percentage of the deterministic SQL query ranks among the top 5 intelligent methods indicates the possibility that the SMS spam could be very balanced data with a clear pattern, which should make possible the search to be much simpler and faster. However, the AI methods have a challenge in the fact that these methods are very process intensive and also require more memory in order to store the learning data. The SQL query giving us a positive result indicates a possibility of optimizing the Bayesian methods towards better effectiveness, efficiency and simplicity by applying tokenization methods adapted to the SMS paradigm along with possible keys to identifying a callback reference. The current machine learning schemes are too machine intensive to be applied on a client device, which may typically be a smart phone. If this simplicity of the SMS pattern can possibly allow us to simplify the algorithms towards improving efficiency along with effectiveness, this can facilitate possible deployment on low power client devices. This approach will allow the algorithms to learn data personalized to each user with the learning focused to individual users rather than one implemented on a server that generates generalised results for all users. ACKNOWLEDGMENT We would like to thank Amir Mohammad Shahi from Swinburne University of Information Technology Sarawak [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] Paul Graham, (August 2002), A plan for spam, viewed: 28 September 2011, <http://paulgraham.com/spam.html> Duan, L., Li, N., & Huang, L. (2009). “A new spam short message classification” 2009 First International Workshop on Education Technology and Computer Science, 168-171. Zhang, H.-yan, & Wang, W. (2009). “Application of Bayesian method to spam sms filtering”. 2009 International Conference on Information Engineering and Computer Science, 1-3. Rick L. Allison, & Peter J. Marsico, US Patent Document -6819932 “Methods and systems for preventing delivery of unwanted short message service (SMS) messages”, (Nov 2004). Freund, Y., Schapire, R. E., & Hill, M. (1996). “Experiments with a new boosting algorithm”. Thirteenth International Conference on Machine Learning, San Francisco, 148-156 Freund, Y., & Schapire, R. E. (1998). Large margin classification using the perceptron algorithm. Proceedings of the eleventh annual conference on Computational learning theory - COLT’ 98, 296, 209-217. Mccallum, A., & Nigam, K. (1998). “A comparison of event models for naive Bayes text classification”. AAAI-98 Workshop on 'Learning for Text Categorization' Liu, J., Ke, H., & Zhang, G. (2010). “Real-time sms filtering system based on bm algorithm”. System, 6-8. Wang, C et. all (2010), “A behavior-based SMS antispam system”, IBM Journal of Research and Development, 3:1 - 3:16 Shirali-Shahreza, M. H., & Shirali-Shahreza, M. (2008). “An anti-smsspam using CAPTCHA”. 2008 ISECS International Colloquium on Computing, Communication, Control, and Management, 318-321. He, P., Sun, Y., Zheng, W., & Wen, X. (2008). “Filtering short message spam of group sending using CAPTCHA”. First International Workshop on Knowledge Discovery and Data Mining (WKDD 2008), 558-561. He, P. (2008). “A Novel Method for Filtering Group Sending Short Message Spam”. Proofs, 60-65. Cai, J., Tang, Y., & Hu, R. (2008). “Spam filter for short messages using winnow”. 2008 International Conference on Advanced Language Processing and Web Information Technology, 454-459. SMS Spam Collection v.1, viewed: 2011 August 9 <www.dt.fee.unicamp.br/~tiago/SMSspamcollection> Weka The University of Waikato, Weka 3: Data Mining Software in Java, viewed on 2011 September 14 <http://www.cs.waikato.ac.nz/ml/weka/> Cormack, G. V., Hidalgo, J. M. G., & Sánz, E. P. (2007). “Feature engineering for mobile (SMS) spam filtering”. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’07, 871. Su, J., Zhang, H., Ling, C. X., & Matwin, S. (2008). “Discriminative parameter learning for Bayesian networks”. Proceedings of the 25th international conference on Machine learning - ICML ’08, 1016-1023. Bayesian Network Classifiers in Weka, viewed on 2011 September 14 <http://www.cs.waikato.ac.nz/~remco/weka.bn.pdf,> Cleary, J. G., & Trigg, L. E. (n.d.). K *: “An Instance-based Learner Using an Entropic Distance Measure”, 12th International Conference on Machine Learning, 108-114. Ross Quinlan (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA D. Aha, D. Kibler, M Albert (1991). “Instance-based learning algorithms”, Machine Learning, Kluwer Academic Publishers, 6:37-66. Alexander Genkin, David D. Lewis, David Madigan (2004). “Largescale Bayesian logistic regression for text categorization”, Technometrics. August 1, 2007, 49(3): 291-304