Latent dirichlet allocation in web spam filtering

István Bíró; Jácint  Szabó

Latent dirichlet allocation in web spam filtering

Proceedings of the 4th international workshop on Adversarial information retrieval on the web - AIRWeb '08, 2008

Latent Dirichlet Allocation in Web Spam Filtering * István Bíró Jácint Szabó András A. Benczúr Data Mining and Web search Research Group, Informatics Laboratory Computer and Automation Research Institute of the Hungarian Academy of Sciences {ibiro, jacint, benczur}@ilab.sztaki.hu ABSTRACT Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003) is a fully generative statistical language model on the con- tent and topics of a corpus of documents. In this paper we apply a modiﬁcation of LDA, the novel multi-corpus LDA technique for web spam classiﬁcation. We create a bag-of- words document for every Web site and run LDA both on the corpus of sites labeled as spam and as non-spam. In this way collections of spam and non-spam topics are created in the training phase. In the test phase we take the union of these collections, and an unseen site is deemed spam if its total spam topic probability is above a threshold. As far as we know, this is the ﬁrst web retrieval application of LDA. We test this method on the UK2007-WEBSPAM corpus, and reach a relative improvement of 11% in F-measure by a logistic regression based combination with strong link and content baseline classiﬁers. Categories and Subject Descriptors H.3 [Information Systems]: Information Storage and Re- trieval; I.2.7 [Computing Methodologies]: Artiﬁcial In- telligence—Natural Language Processing General Terms Text Analysis, Feature Selection, Document Classiﬁcation, Information Retrieval Keywords Web content spam, latent Dirichlet allocation 1. INTRODUCTION Identifying and preventing spam is cited as one of the top challenges in web search engines in [14, 20]. As all major search engines incorporate anchor text and link analysis al- gorithms into their ranking schemes, web spam appears in * This work was supported by the EU FP7 project LiWA – Living Web Archives and by grants OTKA NK 72845, ASTOR NKFP 2/004/05 AIRWeb ’08, April 22, 2008 Beijing, China. sophisticated forms that manipulate content as well as link- age [12]. In this paper we demonstrate the applicability of topic based natural language models for Web spam ﬁltering. Several such models have been developed in the ﬁeld of informa- tion retrieval. Hofmann [15] introduced probabilistic latent semantic indexing (PLSI), which is a generative, graphical model enhancing latent semantic analysis by a sounder prob- abilistic model. Although PLSI had promising results, it suﬀers from two limitations: the number of parameters is linear in the number of documents, and it is not possible to make inference for unseen data. These issues are addressed by latent Dirichlet allocation de- veloped by Blei, Ng and Jordan [4]. LDA is a fully gen- erative graphical model for describing the latent topics of documents. LDA models every topic as a distribution over the words of the vocabulary, and every document as a dis- tribution over the topics. These distributions are sampled from Dirichlet distributions. The words of the documents are drawn from the word distribution of a topic which was just drawn for this word from the topic distribution of the document. There are several methods developed for making inference in LDA such as variational expectation maximiza- tion [4], expectation propagation [17], and Gibbs sampling [11]. LDA is an intensively studied model, and the experi- ments are really impressive compared to other known infor- mation retrieval techniques. LDA has several applications including in entity resolution [3], fraud detection in telecommunication systems [24], and image processing [8, 22], in addition to the large number of applications in the ﬁeld of text retrieval. To our best knowledge our experiments provide the ﬁrst application of LDA in web spam ﬁltering, and even in Web retrieval. In this paper we introduce and apply a slight modiﬁcation of LDA, called multi-corpus LDA as follows. Assume we have a text classiﬁcation task with m classes. We run LDA separately for each class of the training set, then take the union of the resulting topic collections and make inference w.r.t. this aggregated collection of topics for every unseen document d. The total probability of class i topics in the topic distribution of d may serve as a measure to what extent d belongs to class i. For a more detailed description, see Subsection 2.1.

In our experiments we run multi-corpus LDA with m =2 classes: spam and non-spam. The inference is performed using Gibbs sampling. The total probability of spam topics in the topic distribution of an unseen document gives an LDA prediction of being spam or honest. 1.1 Related results Spam hunters use a variety of content based features [5, 9, 18, 10] to detect web spam; a recent measurement of their combination appears in [6]. Perhaps the strongest SVM based content classiﬁcation is described in [1]. An eﬃcient method for combining several classiﬁers is the use of logistic regression, as shown by Lynam and Cormack [16]. Closest to our methods are the content based email spam detection methods applied to Web spam presented at the Web Spam Challenge 2007 [7]. They use the method of [5] that compresses spam and nonspam separately; features are deﬁned based on how well the document in question com- presses with spam and nonspam, respectively. Our method is similar in the sense that we also build separate spam and nonspam content models. 1.2 Data set, evaluation, experimental setup We test the multi-corpus LDA method in combination with the Web Spam Challenge 2008 public features 1 , SVM over pivoted tf.idf [21], and the connectivity sonar features (anal- ysis of the breadth-ﬁrst and directory levels within a host together with the internal and external linkage) of [2]. Using logistic regression to aggregate these classiﬁers, the multi- corpus LDA method yields an improvement of 11% in F- measure and 1.5% in ROC. For classiﬁcation we used the machine learning toolkit Weka [23]. For a detailed explana- tion, see Section 3. 2. METHOD First we describe latent Dirichlet allocation [4]. For a de- tailed elaboration, we refer to Heinrich [13]. We have a vocabulary V consisting of words, a set T of k topics and n documents of arbitrary length. For every topic z a distri- bution ϕz on V is sampled from Dir(β), where β ∈ R V + is a smoothing parameter. Similarly, for every document d a distribution ϑ d on T is sampled from Dir(α), where α ∈ R T + is a smoothing parameter. The words of the documents are drawn as follows: for every word position of document d a topic z is drawn from ϑ d , and then a word is drawn from ϕz and ﬁlled into the position. LDA can be thought of as a Bayesian network, see Figure 1. One method for making inference for LDA is Gibbs sampling [11]. Gibbs sampling is a Markov chain Monte Carlo algo- rithm for sampling from a joint distribution p(x),x ∈ R n , if all conditional distributions p(xi |x-i ) are known (x-i = (x1,...,xi-1,xi+1,...,xn)). The k th transition x (k) → x (k+1) of the Markov chain is generated as follows. Choose an in- dex 1 ≤ i ≤ n (usually i = k mod n), and let x (k+1) = x (k) everywhere except at index i where x (k+1) i is sampled from p(xi |x (k) -i ). 1 Downloaded from http://www.yr-bcn.es/webspam/datasets/ uk2007/features/ in March 31, 2008. β α ϕ ∼ Dir(β) ϑ ∼ Dir(α) z ∼ ϑ w ∼ ϕz k n Figure 1: LDA as a Bayesian network In LDA the goal is to estimate the distribution p(z|w) for z ∈ T P ,w ∈ V P where P denotes the set of word positions in the documents. Thus for Gibbs sampling one has to calculate p(zi |z-i ,w) for i ∈ P . This has an eﬃciently computable closed form (for a deduction, see [13]) p(zi|z-i ,w)= n t i z i - 1+ βt i nz i - 1+ P t βt · n z i d - 1+ αz i n d - 1+ P z αz . (1) Here d is the document of position i, ti is the actual word in position i, n t i z i is the number of positions with topic zi and word ti , nz i is the number of positions with topic zi , n z i d is the number of topics zi in document d, and n d is the length of document d. After a suﬃcient number of iterations we arrive at a topic assignment sample z. Knowing z, we can estimate ϕ and ϑ as ϕz,t = n t z + βt nz + P t βt (2) and ϑ d,z = n z d + αz n d + P z αz . (3) For an unseen document d the ϑ topic distribution can be estimated exactly as in (3) once we have a sample from its word–topic assignment z. Sampling z can be performed with a similar method as before, but now only for the positions i in d: p(zi|z-i ,w)= ˜ n t i z i - 1+ βt i ˜ nz i - 1+ P t βt · n z i d - 1+ αz i n d - 1+ P z αz . (4) The notation ˜ n refers to the union of the whole corpus and document d. In the next subsection we will make use of the observation that the ﬁrst factor in product (4) is approximately equal to ϕz i ,t i by (2). 2.1 Multi-corpus LDA As outlined in the introduction, in the multi-corpus setting we run two distinct LDA’s: one in the collection of labeled spam sites with k (s) topics, called spam topics, and one in the collection of labeled non-spam sites with k (n) topics, called non-spam topics. The vocabulary is the same for both LDA’s. After both inferences have been done, we have word distributions for all k (s) + k (n) topics. From now on we think of the obtained word distributions of the uniﬁed collection of spam and non-spam topics as

Latent Dirichlet Allocation in Web Spam Filtering ∗ István Bíró Jácint Szabó András A. Benczúr Data Mining and Web search Research Group, Informatics Laboratory Computer and Automation Research Institute of the Hungarian Academy of Sciences {ibiro, jacint, benczur}@ilab.sztaki.hu ABSTRACT Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003) is a fully generative statistical language model on the content and topics of a corpus of documents. In this paper we apply a modification of LDA, the novel multi-corpus LDA technique for web spam classification. We create a bag-ofwords document for every Web site and run LDA both on the corpus of sites labeled as spam and as non-spam. In this way collections of spam and non-spam topics are created in the training phase. In the test phase we take the union of these collections, and an unseen site is deemed spam if its total spam topic probability is above a threshold. As far as we know, this is the first web retrieval application of LDA. We test this method on the UK2007-WEBSPAM corpus, and reach a relative improvement of 11% in F-measure by a logistic regression based combination with strong link and content baseline classifiers. Categories and Subject Descriptors H.3 [Information Systems]: Information Storage and Retrieval; I.2.7 [Computing Methodologies]: Artificial Intelligence—Natural Language Processing General Terms Text Analysis, Feature Selection, Document Classification, Information Retrieval Keywords Web content spam, latent Dirichlet allocation 1. INTRODUCTION Identifying and preventing spam is cited as one of the top challenges in web search engines in [14, 20]. As all major search engines incorporate anchor text and link analysis algorithms into their ranking schemes, web spam appears in ∗This work was supported by the EU FP7 project LiWA – Living Web Archives and by grants OTKA NK 72845, ASTOR NKFP 2/004/05 AIRWeb ’08, April 22, 2008 Beijing, China. sophisticated forms that manipulate content as well as linkage [12]. In this paper we demonstrate the applicability of topic based natural language models for Web spam filtering. Several such models have been developed in the field of information retrieval. Hofmann [15] introduced probabilistic latent semantic indexing (PLSI), which is a generative, graphical model enhancing latent semantic analysis by a sounder probabilistic model. Although PLSI had promising results, it suffers from two limitations: the number of parameters is linear in the number of documents, and it is not possible to make inference for unseen data. These issues are addressed by latent Dirichlet allocation developed by Blei, Ng and Jordan [4]. LDA is a fully generative graphical model for describing the latent topics of documents. LDA models every topic as a distribution over the words of the vocabulary, and every document as a distribution over the topics. These distributions are sampled from Dirichlet distributions. The words of the documents are drawn from the word distribution of a topic which was just drawn for this word from the topic distribution of the document. There are several methods developed for making inference in LDA such as variational expectation maximization [4], expectation propagation [17], and Gibbs sampling [11]. LDA is an intensively studied model, and the experiments are really impressive compared to other known information retrieval techniques. LDA has several applications including in entity resolution [3], fraud detection in telecommunication systems [24], and image processing [8, 22], in addition to the large number of applications in the field of text retrieval. To our best knowledge our experiments provide the first application of LDA in web spam filtering, and even in Web retrieval. In this paper we introduce and apply a slight modification of LDA, called multi-corpus LDA as follows. Assume we have a text classification task with m classes. We run LDA separately for each class of the training set, then take the union of the resulting topic collections and make inference w.r.t. this aggregated collection of topics for every unseen document d. The total probability of class i topics in the topic distribution of d may serve as a measure to what extent d belongs to class i. For a more detailed description, see Subsection 2.1. ϕ ∼ Dir(β) In our experiments we run multi-corpus LDA with m = 2 classes: spam and non-spam. The inference is performed using Gibbs sampling. The total probability of spam topics in the topic distribution of an unseen document gives an LDA prediction of being spam or honest. β k 1.1 Related results Spam hunters use a variety of content based features [5, 9, 18, 10] to detect web spam; a recent measurement of their combination appears in [6]. Perhaps the strongest SVM based content classification is described in [1]. An efficient method for combining several classifiers is the use of logistic regression, as shown by Lynam and Cormack [16]. Closest to our methods are the content based email spam detection methods applied to Web spam presented at the Web Spam Challenge 2007 [7]. They use the method of [5] that compresses spam and nonspam separately; features are defined based on how well the document in question compresses with spam and nonspam, respectively. Our method is similar in the sense that we also build separate spam and nonspam content models. 1.2 Data set, evaluation, experimental setup We test the multi-corpus LDA method in combination with the Web Spam Challenge 2008 public features1 , SVM over pivoted tf.idf [21], and the connectivity sonar features (analysis of the breadth-first and directory levels within a host together with the internal and external linkage) of [2]. Using logistic regression to aggregate these classifiers, the multicorpus LDA method yields an improvement of 11% in Fmeasure and 1.5% in ROC. For classification we used the machine learning toolkit Weka [23]. For a detailed explanation, see Section 3. 2. α ϑ ∼ Dir(α) The words of the documents are drawn as follows: for every word position of document d a topic z is drawn from ϑd , and then a word is drawn from ϕz and filled into the position. LDA can be thought of as a Bayesian network, see Figure 1. One method for making inference for LDA is Gibbs sampling [11]. Gibbs sampling is a Markov chain Monte Carlo algorithm for sampling from a joint distribution p(x), x ∈ Rn , if all conditional distributions p(xi |x−i ) are known (x−i = (x1 , . . . , xi−1 , xi+1 , . . . , xn )). The kth transition x(k) → x(k+1) of the Markov chain is generated as follows. Choose an index 1 ≤ i ≤ n (usually i = k mod n), and let x(k+1) = x(k) (k+1) everywhere except at index i where xi is sampled from (k) p(xi |x−i ). 1 Downloaded from http://www.yr-bcn.es/webspam/datasets/ uk2007/features/ in March 31, 2008. w ∼ ϕz n Figure 1: LDA as a Bayesian network In LDA the goal is to estimate the distribution p(z|w) for z ∈ T P , w ∈ V P where P denotes the set of word positions in the documents. Thus for Gibbs sampling one has to calculate p(zi |z−i , w) for i ∈ P . This has an efficiently computable closed form (for a deduction, see [13]) p(zi |z−i , w) = ntzii − 1 + βti nzdi − 1 + αzi P P · . nzi − 1 + t βt nd − 1 + z αz (1) Here d is the document of position i, ti is the actual word in position i, ntzii is the number of positions with topic zi and word ti , nzi is the number of positions with topic zi , nzdi is the number of topics zi in document d, and nd is the length of document d. After a sufficient number of iterations we arrive at a topic assignment sample z. Knowing z, we can estimate ϕ and ϑ as ϕz,t = ntz + βt P nz + t βt (2) ϑd,z = nzd + αz P . nd + z αz (3) and METHOD First we describe latent Dirichlet allocation [4]. For a detailed elaboration, we refer to Heinrich [13]. We have a vocabulary V consisting of words, a set T of k topics and n documents of arbitrary length. For every topic z a distribution ϕz on V is sampled from Dir(β), where β ∈ RV+ is a smoothing parameter. Similarly, for every document d a distribution ϑd on T is sampled from Dir(α), where α ∈ RT+ is a smoothing parameter. z∼ϑ For an unseen document d the ϑ topic distribution can be estimated exactly as in (3) once we have a sample from its word–topic assignment z. Sampling z can be performed with a similar method as before, but now only for the positions i in d: p(zi |z−i , w) = ñtzii − 1 + βti nzdi − 1 + αzi P P · . ñzi − 1 + t βt nd − 1 + z αz (4) The notation ñ refers to the union of the whole corpus and document d. In the next subsection we will make use of the observation that the first factor in product (4) is approximately equal to ϕzi ,ti by (2). 2.1 Multi-corpus LDA As outlined in the introduction, in the multi-corpus setting we run two distinct LDA’s: one in the collection of labeled spam sites with k(s) topics, called spam topics, and one in the collection of labeled non-spam sites with k(n) topics, called non-spam topics. The vocabulary is the same for both LDA’s. After both inferences have been done, we have word distributions for all k(s) + k(n) topics. From now on we think of the obtained word distributions of the unified collection of spam and non-spam topics as p(zi |z−i , w) ≈ ϕzi ,ti · nzdi − 1 + αzi P , nd − 1 + z αz 0.5 0.45 0.4 0.35 F-Measure if they were estimated from only one presumed corpus. To make inference for an unseen document d, we perform Gibbs sampling on this presumed unique distribution using (4). Observe that the ñ terms in the first factor of the product are not known, as also the topic assignments of the presumed corpus are not known. Thus we approximate this first factor by ϕzi ,ti , and p(zi |z−i , w) by (5) 3. EXPERIMENTS 0.2 0.15 0.1 spam:5,nonspam:50 0.05 spam:10,nonspam:50 spam:5,nonspam:10 0 0.1 0.2 0.3 0.4 0.5 Threshold which is a closed form expression that can be computed in O(P · k) steps where P is the number of word occurrences in the corpus and k is the number of topics. To distinguish this method from the original Gibbs sampling inference developed in [11], we call it the multi-corpus inference. This is applied only to unseen documents. 0.6 0.7 0.8 1 0.9 0.8 0.7 Recall After a sufficient number of iterations we calculate ϑd as in P (3), and define the LDA prediction to be {ϑd,z : z is a spam topic}. As a simplest solution we may classify d as spam if its LDA prediction is above a certain threshold. 0.3 0.25 0.6 0.5 0.4 0.3 The data set we used is the UK2007-WEBSPAM corpus. We 0.2 spam-5-nonspam-10 kept only the labeled sites with 203 labeled as spam and 3589 0.1 spam-10-nonspam-50 spam-5-nonspam-50 as non-spam. We aggregated the words and meta keywords 0 appearing in all pages of the sites to form one document per 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 site in a bag of words model (only multiplicity and no order Precision information used). We kept only alphanumeric characters and the hyphen but removed all words containing a hyphen Figure 2: F-measure curves with varying thresholds not between two alphabetical words. We deleted all stop (horizontal axis), and precision-recall curves of the words enumerated in the list of http://www.lextek.com/ three best LDA features. manuals/onix/stopwords1.html, and used the TreeTagger http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ pair of topic numbers F ROC for stemming. After this procedure the most frequent 22,000 5/50 0.451 0.855 words formed the vocabulary. 10/50 0.458 0.861 5/10 0.458 0.868 We used Phan’s GibbsLDA++ C++ code [19] to run LDA and a modified version of it to run the multi-corpus inferTable 1: F-measures and ROC values of the three ence for unseen documents in multi-corpus LDA. We applied best performing LDA predictions. 5-fold cross validation. Two LDA’s were run on the training spam and non-spam corpora, and then multi-corpus inference were made to the test documents by Gibbs sampling as in (5). Figure 2 indicates that the multi-corpus method is robust The Dirichlet parameter β was chosen to be constant 0.1 to the parameter of topic numbers, as the performance does throughout, while α(s) = 50/k(s) , α(n) = 50/k(n) , and durnot really change by changing the topic numbers. As one can expect, the maximum of such an F-measure curve is ing multi-corpus inference α was constant 50/(k(s) + k(n) ) approximately k(s) /k(n) . (these are the default values in GibbsLDA++). We stopped Gibbs sampling after 2000 steps for inference on the training data, and after 1000 steps for the multi-corpus inference for an unseen document. The number of topics were chosen to be k(s) = 2, 5, 10, 20 and k(n) = 10, 20, 50. Consequently, we performed altogether 12 runs, and thus obtained 12 one-dimensional LDA predictions as features. By observing the F-measure curves as shown in Figure 2 we selected the three best performing parameter choices. F-measures and ROC values are shown in Table 1. The best result corresponds to the choice k(s) = 10 and k(n) = 50 with an F-measure of 0.46. We combined the single best performing k(s) = 10, k(n) = 50 LDA prediction with an SVM classifier over the tf.idf features,a C4.5 classifier over the public and the sonar features with logistic regression. All methods were performed by the machine learning toolkit Weka [23]. The F-measures and ROC values are shown in Table 2. LDA improved a relative 11% over the F-measure and 1.5% over the ROC of the remaining combined features. We also performed single-feature classification for the three best performing LDA predictions over the UK2006-WEBSPAM corpus. Here we have 2125 sites labeled as spam, and 8082 feature set text (SVM) public & text & sonar (log) public & text & sonar & lda (log) F 0.554 0.601 0.667 ROC 0.864 0.954 0.969 Table 2: F/ROC values pair of topic numbers 5/50 10/50 5/10 F 0.704 0.735 0.723 ROC 0.881 0.902 0.906 [8] [9] [10] Table 3: F-measures and ROC values for UK2006WEBSPAM [11] labeled as non-spam. The parameters and the setup was the same as above. The results can be seen at Table 3. [12] Conclusion and future work We presented a novel multi-corpus LDA technique that resulted in a relative improvement of about 10% over a strong content and link feature baseline. Although apparently the UK2007-WEBSPAM data is much more sensitive to content than to link features, we reached improvement over the UK2006-WEBSPAM data as well. We believe that similar to the success of email spam filtering methods [7] semantic analysis to spam filtering is a promising direction. In future work we plan to implement the email content filtering methods of [7] and test its combination with LDA. 4. [13] [14] [15] [16] REFERENCES [1] J. Abernethy, O. Chapelle, and C. Castillo. WITCH: A New Approach to Web Spam Detection. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008. [2] E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. Soffer. The Connectivity Sonar: Detecting site functionality by structural patterns. In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia (HT), pages 38–47, Nottingham, United Kingdom, 2003. [3] I. Bhattacharya and L. Getoor. A latent dirichlet model for unsupervised entity resolution. SIAM International Conference on Data Mining, 2006. [4] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(5):993–1022, 2003. [5] A. Bratko, B. Filipič, G. Cormack, T. Lynam, and B. Zupan. Spam Filtering Using Statistical Data Compression Models. The Journal of Machine Learning Research, 7:2673–2698, 2006. [6] C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: Web spam detection using the web topology. Technical report, DELIS – Dynamically Evolving, Large-Scale Information Systems, 2006. [7] G. Cormack. Content-based Web Spam Detection. In Proceedings of the 3rd International Workshop on [17] [18] [19] [20] [21] [22] [23] [24] Adversarial Information Retrieval on the Web (AIRWeb), 2007. L. Fei-Fei and P. Perona. A Bayesian hierarchical model for learning natural scene categories. Proc. CVPR, 5, 2005. D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics – Using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases (WebDB), pages 1–6, Paris, France, 2004. D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the world wide web. In Proceedings of the 28th ACM International Conference on Research and Development in Information Retrieval (SIGIR), Salvador, Brazil, 2005. T. Griffiths. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1):5228–5235, 2004. Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Chiba, Japan, 2005. G. Heinrich. Parameter estimation for text analysis. Technical report, Technical Report, 2004. M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. SIGIR Forum, 36(2):11–22, 2002. T. Hofmann. Probabilistic latent semantic indexing. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50–57, 1999. T. Lynam, G. Cormack, and D. Cheriton. On-line spam filter fusion. Proc. of the 29th international ACM SIGIR conference on Research and development in information retrieval, pages 123–130, 2006. T. Minka and J. Lafferty. Expectation-propagation for the generative aspect model. Uncertainty in Artificial Intelligence (UAI), 2002. A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 15th International World Wide Web Conference (WWW), pages 83–92, Edinburgh, Scotland, 2006. X.-H. Phan. http://gibbslda.sourceforge.net/. A. Singhal. Challenges in running a commercial search engine. In IBM Search and Collaboration Seminar 2004. IBM Haifa Labs, 2004. A. Singhal, G. Salton, M. Mitra, and C. Buckley. Document length normalization. Information Processing and Management, 32(5):619–633, 1996. J. Sivic, B. Russell, A. Efros, A. Zisserman, and W. Freeman. Discovering Objects and their Localization in Images. Computer Vision, ICCV 2005. Tenth IEEE International Conference on, 1, 2005. I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, second edition, June 2005. D. Xing and M. Girolami. Employing Latent Dirichlet Allocation for fraud detection in telecommunications. Pattern Recognition Letters, 28(13):1727–1734, 2007.

毕业证书伪造『海外假学位证书学位认证』《Heriot-Watt 学位证书真的一样》微信176555708【赫瑞瓦特大学全套毕业证文凭证书，入学offer录取通知书】购买学士成绩单赫瑞瓦特大学学位证书（微信176555708）制作Heriot-Watt 学士毕业证成绩单精仿/高仿赫瑞瓦特大学硕士学位文凭证书微信176555708《赫瑞瓦特大学入学offer录取通知书全套文凭》《Heriot-Watt 学士学历认证毕业证》赫瑞瓦特大学文凭证书学位证书学历认证报告怎么弄【毕业证书成绩单】{毕业证书} 【微信176555708】《毕业证明信-推荐信做学费单》成绩单，录取通知书，Offer，在读证明，雅思托福成绩单，真实大使馆教育部认证，回国人员证明，留信网认证。网上存档永久可查！ ◆办理degree，Transcripts（一对一服务包括毕业院长签字,专业课程,学位类型,专业或教育领域,以及毕业日期.不要忽视这些细节.这两份文件同样重要!毕业证成绩单文凭留信网学历认证！）【实体公司】办赫瑞瓦特大学赫瑞瓦特大学毕业证成绩单学历认证学位证文凭认证办留信网认证办留服认证办教育部认证（网上可查实体公司专业可靠） — — — 留学归国服务中心 — — - 【主营项目】一.赫瑞瓦特大学毕业证成绩单使馆认证教育部认证成绩单等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 国外毕业证学位证成绩单办理流程： 1客户提供赫瑞瓦特大学赫瑞瓦特大学毕业证成绩单办理信息：姓名生日专业学位毕业时间等（如信息不确定可以咨询顾问：我们有专业老师帮你查询）； 2开始安排制作毕业证成绩单电子图； 3毕业证成绩单电子版做好以后发送给您确认； 4毕业证成绩单电子版您确认信息无误之后安排制作成品； 5成品做好拍照或者视频给您确认； 6快递给客户（国内顺丰国外DHLUPS等快读邮寄）。专业服务请勿犹豫联系我！本公司是留学创业和海归创业者们的桥梁。一次办理终生受用一步到位高效服务。详情请在线咨询办理,欢迎有诚意办理的客户咨询!洽谈。招聘代理：本公司诚聘英国加拿大澳洲新西兰美国法国德国新加坡各地代理人员如果你有业余时间有兴趣就请联系我们咨询顾问：+微信:176555708

Log In

Latent dirichlet allocation in web spam filtering