Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1531914.1531922acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesiea-aeiConference Proceedingsconference-collections
research-article

Linked latent Dirichlet allocation in web spam filtering

Published: 21 April 2009 Publication History
  • Get Citation Alerts
  • Abstract

    Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003) is a fully generative statistical language model on the content and topics of a corpus of documents. In this paper we apply an extension of LDA for web spam classification. Our linked LDA technique takes also linkage into account: topics are propagated along links in such a way that the linked document directly influences the words in the linking document. The inferred LDA model can be applied for classification as dimensionality reduction similarly to latent semantic indexing. We test linked LDA on the WEBSPAM-UK2007 corpus. By using BayesNet classifier, in terms of the AUC of classification, we achieve 3% improvement over plain LDA with BayesNet, and 8% over the public link features with C4.5. The addition of this method to a log-odds based combination of strong link and content baseline classifiers results in a 3% improvement in AUC. Our method even slightly improves over the best Web Spam Challenge 2008 result.

    References

    [1]
    J. Abernethy, O. Chapelle, and C. Castillo. WITCH: A New Approach to Web Spam Detection. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008.
    [2]
    I. Bíró, J. Szabó, and A. A. Benczúr. Latent Dirichlet Allocation in Web Spam Filtering. manuscript, 2008.
    [3]
    I. Bíró, J. Szabó, and A. A. Benczúr. Very Large Scale Link Based Latent Dirichlet Allocation for Web Document Classification. manuscript, http://www.ilab.sztaki.hu/~ibiro/linkedLDA/, 2009.
    [4]
    D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(5):993--1022, 2003.
    [5]
    A. Bratko, B. Filipič, G. Cormack, T. Lynam, and B. Zupan. Spam Filtering Using Statistical Data Compression Models. The Journal of Machine Learning Research, 7:2673--2698, 2006.
    [6]
    C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: web spam detection using the web topology. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 423--430, 2007.
    [7]
    D. Cohn and T. Hofmann. The Missing Link-A Probabilistic Model of Document Content and Hypertext Connectivity. Advances in Neural Information Processing Systems, pages 430--436, 2001.
    [8]
    S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.
    [9]
    L. Dietz, S. Bickel, and T. Scheffer. Unsupervised prediction of citation influences. In Proceedings of the 24th international conference on Machine learning, pages 233--240. ACM Press New York, NY, USA, 2007.
    [10]
    E. Erosheva, S. Fienberg, and J. Lafferty. Mixed-membership models of scientific publications, 2004.
    [11]
    D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics -- Using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases (WebDB), pages 1--6, Paris, France, 2004.
    [12]
    D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the world wide web. In Proceedings of the 28th ACM International Conference on Research and Development in Information Retrieval (SIGIR), Salvador, Brazil, 2005.
    [13]
    T. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl_1):5228--5235, 2004.
    [14]
    Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Chiba, Japan, 2005.
    [15]
    G. Heinrich. Parameter estimation for text analysis. Technical report, Technical Report, 2004.
    [16]
    M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. SIGIR Forum, 36(2):11--22, 2002.
    [17]
    T. Hofmann. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning, 42(1):177--196, 2001.
    [18]
    Z. Kou and W. W. Cohen. Stacked graphical models for efficient inference in markov random fields. In SDM 07, 2007.
    [19]
    T. Lynam, G. Cormack, and D. Cheriton. On-line spam filter fusion. Proc. of the 29th international ACM SIGIR conference on Research and development in information retrieval, pages 123--130, 2006.
    [20]
    R. Nallapati, A. Ahmed, E. Xing, and W. Cohen. Joint Latent Topic Models for Text and Citations. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press New York, NY, USA, 2008.
    [21]
    A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 15th International World Wide Web Conference (WWW), pages 83--92, Edinburgh, Scotland, 2006.
    [22]
    A. Singhal. Challenges in running a commercial search engine. In IBM Search and Collaboration Seminar 2004. IBM Haifa Labs, 2004.
    [23]
    I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, second edition, June 2005.
    [24]
    X. Zhu, J. Kandola, Z. Ghahramani, and J. Lafferty. Nonparametric transforms of graph kernels for semi-supervised learning. Advances in Neural Information Processing Systems, 17:1641--1648, 2005.

    Cited By

    View all
    • (2024)A Review: Comprehensive study on societal Analysis for health care system Using topic modeling Paradigms2024 International Conference on Advancements in Smart, Secure and Intelligent Computing (ASSIC)10.1109/ASSIC60049.2024.10507910(1-5)Online publication date: 27-Jan-2024
    • (2023)It is about inclusion! Mining online reviews to understand the needs of adaptive clothing customersInternational Journal of Consumer Studies10.1111/ijcs.1289547:3(1157-1172)Online publication date: 18-Jan-2023
    • (2023)Incorporating Word Embedding and Hybrid Model Random Forest Softmax Regression for Predicting News CategoriesMultimedia Tools and Applications10.1007/s11042-023-16491-7Online publication date: 15-Sep-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    AIRWeb '09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
    April 2009
    67 pages
    ISBN:9781605584386
    DOI:10.1145/1531914
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 April 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. document classification
    2. feature selection
    3. information retrieval
    4. latent Dirichlet allocation
    5. text analysis
    6. web content spam

    Qualifiers

    • Research-article

    Conference

    AIRWeb '09

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Review: Comprehensive study on societal Analysis for health care system Using topic modeling Paradigms2024 International Conference on Advancements in Smart, Secure and Intelligent Computing (ASSIC)10.1109/ASSIC60049.2024.10507910(1-5)Online publication date: 27-Jan-2024
    • (2023)It is about inclusion! Mining online reviews to understand the needs of adaptive clothing customersInternational Journal of Consumer Studies10.1111/ijcs.1289547:3(1157-1172)Online publication date: 18-Jan-2023
    • (2023)Incorporating Word Embedding and Hybrid Model Random Forest Softmax Regression for Predicting News CategoriesMultimedia Tools and Applications10.1007/s11042-023-16491-7Online publication date: 15-Sep-2023
    • (2023)Enhancing Readability in Custom Templates for Displaying Semantically Marked InformationAdvances in Computer Science for Engineering and Education VI10.1007/978-3-031-36118-0_13(137-146)Online publication date: 19-Aug-2023
    • (2021)Fesztivállátogatók véleményeinek számítógéppel támogatott tematikus modellezése– egy kísérlet eredményei = Computer-aided topic modelling based on festival-goers’ opinions– results of an experimentTurizmus Bulletin10.14267/TURBULL.2021v21n1.121:1(4-12)Online publication date: 21-Apr-2021
    • (2020)Detecting Web Spam Based on Novel Features from Web Page Source CodeSecurity and Communication Networks10.1155/2020/66621662020Online publication date: 17-Dec-2020
    • (2019)Identifying Latent Semantics in Action Games for Player ModelingInternational Journal of Gaming and Computer-Mediated Simulations10.4018/IJGCMS.201904010111:2(1-21)Online publication date: Apr-2019
    • (2019)Practical Web Spam Lifelong Machine Learning System with Automatic Adjustment to Current Lifecycle PhaseSecurity and Communication Networks10.1155/2019/65870202019Online publication date: 20-Feb-2019
    • (2019)Filtering spam messages and mails using fuzzy C means algorithm2019 4th International Conference on Internet of Things: Smart Innovation and Usages (IoT-SIU)10.1109/IoT-SIU.2019.8777483(1-5)Online publication date: Apr-2019
    • (2018)Unveiling Topics from Scientific Literature on the Subject of Self-driving Cars using Latent Dirichlet Allocation2018 IEEE 9th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON)10.1109/IEMCON.2018.8615056(1113-1119)Online publication date: Nov-2018
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media