Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1166160.1166191acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
Article

Content based SMS spam filtering

Published: 10 October 2006 Publication History
  • Get Citation Alerts
  • Abstract

    In the recent years, we have witnessed a dramatic increment in the volume of spam email. Other related forms of spam are increasingly revealing as a problem of importance, specially the spam on Instant Messaging services (the so called SPIM), and Short Message Service (SMS) or mobile spam.Like email spam, the SMS spam problem can be approached with legal, economic or technical measures. Among the wide range of technical measures, Bayesian filters are playing a key role in stopping email spam. In this paper, we analyze to what extent Bayesian filtering techniques used to block email spam, can be applied to the problem of detecting and stopping mobile spam. In particular, we have built two SMS spam test collections of significant size, in English and Spanish. We have tested on them a number of messages representation techniques and Machine Learning algorithms, in terms of effectiveness. Our results demonstrate that Bayesian filtering techniques can be effectively transferred from email to SMS spam.

    References

    [1]
    Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Paliouras, G., Spyropoulos, C.D. An Evaluation of Naive Bayesian Anti-spam Filtering. Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning (ECML 2000), pp. 9--17, 2000.
    [2]
    Christine E. Drakeand Jonathan J. Oliver, Eugene J. Koontz. Anatomy of a Phishing Email. Proceedings of the First Conference on Email and Anti-spam (CEAS), 2004.
    [3]
    Graham, Paul. Better Bayesian Filtering. Proceedings of the 2003 Spam Conference, January 2003.
    [4]
    Gómez, J.M., Maña-López, M., Puertas, E. Combining Text and Heuristics for Cost-Sensitive spam Filtering. 4 One of the strengths of the ROCCH method is that it is able to detect that a specific classifier for a given cost may me optimal for other cost distributions, given that the class distribution affects also classifiers learning and performance. Proceedings of the Fourth Computational Natural Language Learning Workshop, CoNLL-2000, Association for Computational Linguistics, 2000.
    [5]
    Domingos, P. 1999. Metacost: A general method for making classifiers cost-sensitive. Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining.
    [6]
    Drucker, H, Vapnik, V., Wu, D. Support Vector Machines for spam Categorization. IEEE Transactions on Neural Networks, 10(5), pp. 1048--1054, 1999.
    [7]
    Frank, E., I.H. Witten. 1998. Generating accurate rule sets without global optimization. Machine Learning: Proceedings of the Fifteenth International Conference.
    [8]
    Gómez, J.M. 2002. Evaluating cost-sensitive unsolicited bulk email categorization. Proceedings of the ACM Symposium on Applied Computing.
    [9]
    Joachims, T. 2001. A statistical learning model of text classification with support vector machines. En Proceedings of the 24th ACM International Conference on Research and Development in Information Retrieval. ACM Press.
    [10]
    Lewis, D.D. 1998. Naive (Bayes) at forty: The independence assumption in information retrieval. En Proceedings of the 10th European Conference on Machine Learning. Springer Verlag.
    [11]
    Provost, F., T. Fawcett. 2001. Robust classification for imprecise environments. Machine Learning Journal, 42(3):203--231.
    [12]
    Quinlan, J.R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann.
    [13]
    Salton, G. 1989. Automatic text processing: the transformation, analysis and retrieval of information by computer. Addison-Wesley.
    [14]
    Sebastiani, F., 2002. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47.
    [15]
    Ting, K.M. 1998. Inducing cost-sensitive trees via instance weighting. En Proceedings of the 2nd European Symposium on Principles of Data Mining and Knowledge Discovery, 139--147.
    [16]
    Witten, I.H., E. Frank. 1999. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann.
    [17]
    Xiang,Y., Chowdhury, M., Ali, S. Filtering Mobile spam by Support Vector Machine. Proceedings of CSITeA-04, ISCA Press, December 27--29, 2004.
    [18]
    Yang, Y. 1999. An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1/2):69--90.
    [19]
    Yang, Y., J.O. Pedersen. 1997. A comparative study on feature selection in text categorization. En Proceedings of the 14th International Conference on Machine Learning.
    [20]
    Bratko, A, B. Filipic. Spam Filtering using Character-level Markov Models: Experiments for the TREC 2005 Spam Track. Proceedings of the 2005 Text Retrieval Conference, 2005.
    [21]
    Sahami, M., Dumais, S., Heckerman, D., Horvitz, E. A bayesian approach to filtering junk e-mail. In Learning for Text Categorization: Papers from the 1998 Workshop, Madison, Wisconsin, 1998. AAAI Technical Report WS-98-05.
    [22]
    Dwork, C., Goldberg A., Naor M. On memory-bound functions for fighting spam. In Proceedings of the 23rd Annual International Cryptology Conference (CRYPTO 2003), August 2003.
    [23]
    R.J. Hall. How to avoid unwanted email. Communications of the ACM, March 1998.
    [24]
    Golbeck, J., Hendler, J. Reputation network analysis for email filtering. In Proceedings of the First Conference on Email and Anti-Spam (CEAS), 2004.
    [25]
    Tompkins T., Handley D. Giving e-mail back to the users: Using digital signatures to solve the spam problem. First Monday, 8(9), September 2003.

    Cited By

    View all
    • (2024)Korean Voice Phishing Detection Applying NER With Key Tags and Sentence-Level N-GramIEEE Access10.1109/ACCESS.2024.338702712(52951-52962)Online publication date: 2024
    • (2024)Investigating Evasive Techniques in SMS Spam Filtering: A Comparative Analysis of Machine Learning ModelsIEEE Access10.1109/ACCESS.2024.336467112(24306-24324)Online publication date: 2024
    • (2024)An optimal feature selection method for text classification through redundancy and synergy analysisMultimedia Tools and Applications10.1007/s11042-024-19736-1Online publication date: 28-Jun-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    DocEng '06: Proceedings of the 2006 ACM symposium on Document engineering
    October 2006
    232 pages
    ISBN:1595935150
    DOI:10.1145/1166160
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 October 2006

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. bayesian filter
    2. junk
    3. receiver operating characteristic
    4. spam

    Qualifiers

    • Article

    Conference

    DocEng06
    Sponsor:
    DocEng06: ACM Symposium on Document Engineering
    October 10 - 13, 2006
    Amsterdam, The Netherlands

    Acceptance Rates

    Overall Acceptance Rate 178 of 537 submissions, 33%

    Upcoming Conference

    DocEng '24
    ACM Symposium on Document Engineering 2024
    August 20 - 23, 2024
    San Jose , CA , USA

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)50
    • Downloads (Last 6 weeks)4
    Reflects downloads up to

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Korean Voice Phishing Detection Applying NER With Key Tags and Sentence-Level N-GramIEEE Access10.1109/ACCESS.2024.338702712(52951-52962)Online publication date: 2024
    • (2024)Investigating Evasive Techniques in SMS Spam Filtering: A Comparative Analysis of Machine Learning ModelsIEEE Access10.1109/ACCESS.2024.336467112(24306-24324)Online publication date: 2024
    • (2024)An optimal feature selection method for text classification through redundancy and synergy analysisMultimedia Tools and Applications10.1007/s11042-024-19736-1Online publication date: 28-Jun-2024
    • (2024)Semantic similarity-aware feature selection and redundancy removal for text classification using joint mutual informationKnowledge and Information Systems10.1007/s10115-024-02143-1Online publication date: 13-Jun-2024
    • (2023)Robust weak supervision with variational auto-encodersProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619840(34394-34408)Online publication date: 23-Jul-2023
    • (2023)BTSAMAInternational Journal of Ambient Computing and Intelligence10.4018/IJACI.32735114:1(1-23)Online publication date: 31-Jul-2023
    • (2023)Rotational Invariance Using Gabor Convolution Neural Network and Color Space for Image ProcessingInternational Journal of Ambient Computing and Intelligence10.4018/IJACI.32379814:1(1-11)Online publication date: 23-May-2023
    • (2023)Traffic Light System With Embedded GPS (Global Positioning System) and GSM (Global System for Mobile Communications) ShieldInternational Journal of Ambient Computing and Intelligence10.4018/IJACI.32319614:1(1-13)Online publication date: 12-May-2023
    • (2023)Cyber Security Using Machine Learning TechniquesProceedings of the International Conference on Applications of Machine Intelligence and Data Analytics (ICAMIDA 2022)10.2991/978-94-6463-136-4_59(680-701)Online publication date: 1-May-2023
    • (2023)An enhanced random forest approach using CoClust clustering: MIMIC-III and SMS spam collection applicationJournal of Big Data10.1186/s40537-023-00720-910:1Online publication date: 30-Mar-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media