Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Latent dirichlet allocation in web spam filtering

Proceedings of the 4th international workshop on Adversarial information retrieval on the web - AIRWeb '08, 2008
...Read more
Latent Dirichlet Allocation in Web Spam Filtering * István Bíró Jácint Szabó András A. Benczúr Data Mining and Web search Research Group, Informatics Laboratory Computer and Automation Research Institute of the Hungarian Academy of Sciences {ibiro, jacint, benczur}@ilab.sztaki.hu ABSTRACT Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003) is a fully generative statistical language model on the con- tent and topics of a corpus of documents. In this paper we apply a modification of LDA, the novel multi-corpus LDA technique for web spam classification. We create a bag-of- words document for every Web site and run LDA both on the corpus of sites labeled as spam and as non-spam. In this way collections of spam and non-spam topics are created in the training phase. In the test phase we take the union of these collections, and an unseen site is deemed spam if its total spam topic probability is above a threshold. As far as we know, this is the first web retrieval application of LDA. We test this method on the UK2007-WEBSPAM corpus, and reach a relative improvement of 11% in F-measure by a logistic regression based combination with strong link and content baseline classifiers. Categories and Subject Descriptors H.3 [Information Systems]: Information Storage and Re- trieval; I.2.7 [Computing Methodologies]: Artificial In- telligence—Natural Language Processing General Terms Text Analysis, Feature Selection, Document Classification, Information Retrieval Keywords Web content spam, latent Dirichlet allocation 1. INTRODUCTION Identifying and preventing spam is cited as one of the top challenges in web search engines in [14, 20]. As all major search engines incorporate anchor text and link analysis al- gorithms into their ranking schemes, web spam appears in * This work was supported by the EU FP7 project LiWA – Living Web Archives and by grants OTKA NK 72845, ASTOR NKFP 2/004/05 AIRWeb ’08, April 22, 2008 Beijing, China. sophisticated forms that manipulate content as well as link- age [12]. In this paper we demonstrate the applicability of topic based natural language models for Web spam filtering. Several such models have been developed in the field of informa- tion retrieval. Hofmann [15] introduced probabilistic latent semantic indexing (PLSI), which is a generative, graphical model enhancing latent semantic analysis by a sounder prob- abilistic model. Although PLSI had promising results, it suffers from two limitations: the number of parameters is linear in the number of documents, and it is not possible to make inference for unseen data. These issues are addressed by latent Dirichlet allocation de- veloped by Blei, Ng and Jordan [4]. LDA is a fully gen- erative graphical model for describing the latent topics of documents. LDA models every topic as a distribution over the words of the vocabulary, and every document as a dis- tribution over the topics. These distributions are sampled from Dirichlet distributions. The words of the documents are drawn from the word distribution of a topic which was just drawn for this word from the topic distribution of the document. There are several methods developed for making inference in LDA such as variational expectation maximiza- tion [4], expectation propagation [17], and Gibbs sampling [11]. LDA is an intensively studied model, and the experi- ments are really impressive compared to other known infor- mation retrieval techniques. LDA has several applications including in entity resolution [3], fraud detection in telecommunication systems [24], and image processing [8, 22], in addition to the large number of applications in the field of text retrieval. To our best knowledge our experiments provide the first application of LDA in web spam filtering, and even in Web retrieval. In this paper we introduce and apply a slight modification of LDA, called multi-corpus LDA as follows. Assume we have a text classification task with m classes. We run LDA separately for each class of the training set, then take the union of the resulting topic collections and make inference w.r.t. this aggregated collection of topics for every unseen document d. The total probability of class i topics in the topic distribution of d may serve as a measure to what extent d belongs to class i. For a more detailed description, see Subsection 2.1.
In our experiments we run multi-corpus LDA with m =2 classes: spam and non-spam. The inference is performed using Gibbs sampling. The total probability of spam topics in the topic distribution of an unseen document gives an LDA prediction of being spam or honest. 1.1 Related results Spam hunters use a variety of content based features [5, 9, 18, 10] to detect web spam; a recent measurement of their combination appears in [6]. Perhaps the strongest SVM based content classification is described in [1]. An efficient method for combining several classifiers is the use of logistic regression, as shown by Lynam and Cormack [16]. Closest to our methods are the content based email spam detection methods applied to Web spam presented at the Web Spam Challenge 2007 [7]. They use the method of [5] that compresses spam and nonspam separately; features are defined based on how well the document in question com- presses with spam and nonspam, respectively. Our method is similar in the sense that we also build separate spam and nonspam content models. 1.2 Data set, evaluation, experimental setup We test the multi-corpus LDA method in combination with the Web Spam Challenge 2008 public features 1 , SVM over pivoted tf.idf [21], and the connectivity sonar features (anal- ysis of the breadth-first and directory levels within a host together with the internal and external linkage) of [2]. Using logistic regression to aggregate these classifiers, the multi- corpus LDA method yields an improvement of 11% in F- measure and 1.5% in ROC. For classification we used the machine learning toolkit Weka [23]. For a detailed explana- tion, see Section 3. 2. METHOD First we describe latent Dirichlet allocation [4]. For a de- tailed elaboration, we refer to Heinrich [13]. We have a vocabulary V consisting of words, a set T of k topics and n documents of arbitrary length. For every topic z a distri- bution ϕz on V is sampled from Dir(β), where β R V + is a smoothing parameter. Similarly, for every document d a distribution ϑ d on T is sampled from Dir(α), where α R T + is a smoothing parameter. The words of the documents are drawn as follows: for every word position of document d a topic z is drawn from ϑ d , and then a word is drawn from ϕz and filled into the position. LDA can be thought of as a Bayesian network, see Figure 1. One method for making inference for LDA is Gibbs sampling [11]. Gibbs sampling is a Markov chain Monte Carlo algo- rithm for sampling from a joint distribution p(x),x R n , if all conditional distributions p(xi |x-i ) are known (x-i = (x1,...,xi-1,xi+1,...,xn)). The k th transition x (k) x (k+1) of the Markov chain is generated as follows. Choose an in- dex 1 i n (usually i = k mod n), and let x (k+1) = x (k) everywhere except at index i where x (k+1) i is sampled from p(xi |x (k) -i ). 1 Downloaded from http://www.yr-bcn.es/webspam/datasets/ uk2007/features/ in March 31, 2008. β α ϕ Dir(β) ϑ Dir(α) z ϑ w ϕz k n Figure 1: LDA as a Bayesian network In LDA the goal is to estimate the distribution p(z|w) for z T P ,w V P where P denotes the set of word positions in the documents. Thus for Gibbs sampling one has to calculate p(zi |z-i ,w) for i P . This has an efficiently computable closed form (for a deduction, see [13]) p(zi|z-i ,w)= n t i z i - 1+ βt i nz i - 1+ P t βt · n z i d - 1+ αz i n d - 1+ P z αz . (1) Here d is the document of position i, ti is the actual word in position i, n t i z i is the number of positions with topic zi and word ti , nz i is the number of positions with topic zi , n z i d is the number of topics zi in document d, and n d is the length of document d. After a sufficient number of iterations we arrive at a topic assignment sample z. Knowing z, we can estimate ϕ and ϑ as ϕz,t = n t z + βt nz + P t βt (2) and ϑ d,z = n z d + αz n d + P z αz . (3) For an unseen document d the ϑ topic distribution can be estimated exactly as in (3) once we have a sample from its word–topic assignment z. Sampling z can be performed with a similar method as before, but now only for the positions i in d: p(zi|z-i ,w)= ˜ n t i z i - 1+ βt i ˜ nz i - 1+ P t βt · n z i d - 1+ αz i n d - 1+ P z αz . (4) The notation ˜ n refers to the union of the whole corpus and document d. In the next subsection we will make use of the observation that the first factor in product (4) is approximately equal to ϕz i ,t i by (2). 2.1 Multi-corpus LDA As outlined in the introduction, in the multi-corpus setting we run two distinct LDA’s: one in the collection of labeled spam sites with k (s) topics, called spam topics, and one in the collection of labeled non-spam sites with k (n) topics, called non-spam topics. The vocabulary is the same for both LDA’s. After both inferences have been done, we have word distributions for all k (s) + k (n) topics. From now on we think of the obtained word distributions of the unified collection of spam and non-spam topics as
Latent Dirichlet Allocation in Web Spam Filtering ∗ István Bíró Jácint Szabó András A. Benczúr Data Mining and Web search Research Group, Informatics Laboratory Computer and Automation Research Institute of the Hungarian Academy of Sciences {ibiro, jacint, benczur}@ilab.sztaki.hu ABSTRACT Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003) is a fully generative statistical language model on the content and topics of a corpus of documents. In this paper we apply a modification of LDA, the novel multi-corpus LDA technique for web spam classification. We create a bag-ofwords document for every Web site and run LDA both on the corpus of sites labeled as spam and as non-spam. In this way collections of spam and non-spam topics are created in the training phase. In the test phase we take the union of these collections, and an unseen site is deemed spam if its total spam topic probability is above a threshold. As far as we know, this is the first web retrieval application of LDA. We test this method on the UK2007-WEBSPAM corpus, and reach a relative improvement of 11% in F-measure by a logistic regression based combination with strong link and content baseline classifiers. Categories and Subject Descriptors H.3 [Information Systems]: Information Storage and Retrieval; I.2.7 [Computing Methodologies]: Artificial Intelligence—Natural Language Processing General Terms Text Analysis, Feature Selection, Document Classification, Information Retrieval Keywords Web content spam, latent Dirichlet allocation 1. INTRODUCTION Identifying and preventing spam is cited as one of the top challenges in web search engines in [14, 20]. As all major search engines incorporate anchor text and link analysis algorithms into their ranking schemes, web spam appears in ∗This work was supported by the EU FP7 project LiWA – Living Web Archives and by grants OTKA NK 72845, ASTOR NKFP 2/004/05 AIRWeb ’08, April 22, 2008 Beijing, China. sophisticated forms that manipulate content as well as linkage [12]. In this paper we demonstrate the applicability of topic based natural language models for Web spam filtering. Several such models have been developed in the field of information retrieval. Hofmann [15] introduced probabilistic latent semantic indexing (PLSI), which is a generative, graphical model enhancing latent semantic analysis by a sounder probabilistic model. Although PLSI had promising results, it suffers from two limitations: the number of parameters is linear in the number of documents, and it is not possible to make inference for unseen data. These issues are addressed by latent Dirichlet allocation developed by Blei, Ng and Jordan [4]. LDA is a fully generative graphical model for describing the latent topics of documents. LDA models every topic as a distribution over the words of the vocabulary, and every document as a distribution over the topics. These distributions are sampled from Dirichlet distributions. The words of the documents are drawn from the word distribution of a topic which was just drawn for this word from the topic distribution of the document. There are several methods developed for making inference in LDA such as variational expectation maximization [4], expectation propagation [17], and Gibbs sampling [11]. LDA is an intensively studied model, and the experiments are really impressive compared to other known information retrieval techniques. LDA has several applications including in entity resolution [3], fraud detection in telecommunication systems [24], and image processing [8, 22], in addition to the large number of applications in the field of text retrieval. To our best knowledge our experiments provide the first application of LDA in web spam filtering, and even in Web retrieval. In this paper we introduce and apply a slight modification of LDA, called multi-corpus LDA as follows. Assume we have a text classification task with m classes. We run LDA separately for each class of the training set, then take the union of the resulting topic collections and make inference w.r.t. this aggregated collection of topics for every unseen document d. The total probability of class i topics in the topic distribution of d may serve as a measure to what extent d belongs to class i. For a more detailed description, see Subsection 2.1. ϕ ∼ Dir(β) In our experiments we run multi-corpus LDA with m = 2 classes: spam and non-spam. The inference is performed using Gibbs sampling. The total probability of spam topics in the topic distribution of an unseen document gives an LDA prediction of being spam or honest. β k 1.1 Related results Spam hunters use a variety of content based features [5, 9, 18, 10] to detect web spam; a recent measurement of their combination appears in [6]. Perhaps the strongest SVM based content classification is described in [1]. An efficient method for combining several classifiers is the use of logistic regression, as shown by Lynam and Cormack [16]. Closest to our methods are the content based email spam detection methods applied to Web spam presented at the Web Spam Challenge 2007 [7]. They use the method of [5] that compresses spam and nonspam separately; features are defined based on how well the document in question compresses with spam and nonspam, respectively. Our method is similar in the sense that we also build separate spam and nonspam content models. 1.2 Data set, evaluation, experimental setup We test the multi-corpus LDA method in combination with the Web Spam Challenge 2008 public features1 , SVM over pivoted tf.idf [21], and the connectivity sonar features (analysis of the breadth-first and directory levels within a host together with the internal and external linkage) of [2]. Using logistic regression to aggregate these classifiers, the multicorpus LDA method yields an improvement of 11% in Fmeasure and 1.5% in ROC. For classification we used the machine learning toolkit Weka [23]. For a detailed explanation, see Section 3. 2. α ϑ ∼ Dir(α) The words of the documents are drawn as follows: for every word position of document d a topic z is drawn from ϑd , and then a word is drawn from ϕz and filled into the position. LDA can be thought of as a Bayesian network, see Figure 1. One method for making inference for LDA is Gibbs sampling [11]. Gibbs sampling is a Markov chain Monte Carlo algorithm for sampling from a joint distribution p(x), x ∈ Rn , if all conditional distributions p(xi |x−i ) are known (x−i = (x1 , . . . , xi−1 , xi+1 , . . . , xn )). The kth transition x(k) → x(k+1) of the Markov chain is generated as follows. Choose an index 1 ≤ i ≤ n (usually i = k mod n), and let x(k+1) = x(k) (k+1) everywhere except at index i where xi is sampled from (k) p(xi |x−i ). 1 Downloaded from http://www.yr-bcn.es/webspam/datasets/ uk2007/features/ in March 31, 2008. w ∼ ϕz n Figure 1: LDA as a Bayesian network In LDA the goal is to estimate the distribution p(z|w) for z ∈ T P , w ∈ V P where P denotes the set of word positions in the documents. Thus for Gibbs sampling one has to calculate p(zi |z−i , w) for i ∈ P . This has an efficiently computable closed form (for a deduction, see [13]) p(zi |z−i , w) = ntzii − 1 + βti nzdi − 1 + αzi P P · . nzi − 1 + t βt nd − 1 + z αz (1) Here d is the document of position i, ti is the actual word in position i, ntzii is the number of positions with topic zi and word ti , nzi is the number of positions with topic zi , nzdi is the number of topics zi in document d, and nd is the length of document d. After a sufficient number of iterations we arrive at a topic assignment sample z. Knowing z, we can estimate ϕ and ϑ as ϕz,t = ntz + βt P nz + t βt (2) ϑd,z = nzd + αz P . nd + z αz (3) and METHOD First we describe latent Dirichlet allocation [4]. For a detailed elaboration, we refer to Heinrich [13]. We have a vocabulary V consisting of words, a set T of k topics and n documents of arbitrary length. For every topic z a distribution ϕz on V is sampled from Dir(β), where β ∈ RV+ is a smoothing parameter. Similarly, for every document d a distribution ϑd on T is sampled from Dir(α), where α ∈ RT+ is a smoothing parameter. z∼ϑ For an unseen document d the ϑ topic distribution can be estimated exactly as in (3) once we have a sample from its word–topic assignment z. Sampling z can be performed with a similar method as before, but now only for the positions i in d: p(zi |z−i , w) = ñtzii − 1 + βti nzdi − 1 + αzi P P · . ñzi − 1 + t βt nd − 1 + z αz (4) The notation ñ refers to the union of the whole corpus and document d. In the next subsection we will make use of the observation that the first factor in product (4) is approximately equal to ϕzi ,ti by (2). 2.1 Multi-corpus LDA As outlined in the introduction, in the multi-corpus setting we run two distinct LDA’s: one in the collection of labeled spam sites with k(s) topics, called spam topics, and one in the collection of labeled non-spam sites with k(n) topics, called non-spam topics. The vocabulary is the same for both LDA’s. After both inferences have been done, we have word distributions for all k(s) + k(n) topics. From now on we think of the obtained word distributions of the unified collection of spam and non-spam topics as p(zi |z−i , w) ≈ ϕzi ,ti · nzdi − 1 + αzi P , nd − 1 + z αz 0.5 0.45 0.4 0.35 F-Measure if they were estimated from only one presumed corpus. To make inference for an unseen document d, we perform Gibbs sampling on this presumed unique distribution using (4). Observe that the ñ terms in the first factor of the product are not known, as also the topic assignments of the presumed corpus are not known. Thus we approximate this first factor by ϕzi ,ti , and p(zi |z−i , w) by (5) 3. EXPERIMENTS 0.2 0.15 0.1 spam:5,nonspam:50 0.05 spam:10,nonspam:50 spam:5,nonspam:10 0 0.1 0.2 0.3 0.4 0.5 Threshold which is a closed form expression that can be computed in O(P · k) steps where P is the number of word occurrences in the corpus and k is the number of topics. To distinguish this method from the original Gibbs sampling inference developed in [11], we call it the multi-corpus inference. This is applied only to unseen documents. 0.6 0.7 0.8 1 0.9 0.8 0.7 Recall After a sufficient number of iterations we calculate ϑd as in P (3), and define the LDA prediction to be {ϑd,z : z is a spam topic}. As a simplest solution we may classify d as spam if its LDA prediction is above a certain threshold. 0.3 0.25 0.6 0.5 0.4 0.3 The data set we used is the UK2007-WEBSPAM corpus. We 0.2 spam-5-nonspam-10 kept only the labeled sites with 203 labeled as spam and 3589 0.1 spam-10-nonspam-50 spam-5-nonspam-50 as non-spam. We aggregated the words and meta keywords 0 appearing in all pages of the sites to form one document per 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 site in a bag of words model (only multiplicity and no order Precision information used). We kept only alphanumeric characters and the hyphen but removed all words containing a hyphen Figure 2: F-measure curves with varying thresholds not between two alphabetical words. We deleted all stop (horizontal axis), and precision-recall curves of the words enumerated in the list of http://www.lextek.com/ three best LDA features. manuals/onix/stopwords1.html, and used the TreeTagger http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ pair of topic numbers F ROC for stemming. After this procedure the most frequent 22,000 5/50 0.451 0.855 words formed the vocabulary. 10/50 0.458 0.861 5/10 0.458 0.868 We used Phan’s GibbsLDA++ C++ code [19] to run LDA and a modified version of it to run the multi-corpus inferTable 1: F-measures and ROC values of the three ence for unseen documents in multi-corpus LDA. We applied best performing LDA predictions. 5-fold cross validation. Two LDA’s were run on the training spam and non-spam corpora, and then multi-corpus inference were made to the test documents by Gibbs sampling as in (5). Figure 2 indicates that the multi-corpus method is robust The Dirichlet parameter β was chosen to be constant 0.1 to the parameter of topic numbers, as the performance does throughout, while α(s) = 50/k(s) , α(n) = 50/k(n) , and durnot really change by changing the topic numbers. As one can expect, the maximum of such an F-measure curve is ing multi-corpus inference α was constant 50/(k(s) + k(n) ) approximately k(s) /k(n) . (these are the default values in GibbsLDA++). We stopped Gibbs sampling after 2000 steps for inference on the training data, and after 1000 steps for the multi-corpus inference for an unseen document. The number of topics were chosen to be k(s) = 2, 5, 10, 20 and k(n) = 10, 20, 50. Consequently, we performed altogether 12 runs, and thus obtained 12 one-dimensional LDA predictions as features. By observing the F-measure curves as shown in Figure 2 we selected the three best performing parameter choices. F-measures and ROC values are shown in Table 1. The best result corresponds to the choice k(s) = 10 and k(n) = 50 with an F-measure of 0.46. We combined the single best performing k(s) = 10, k(n) = 50 LDA prediction with an SVM classifier over the tf.idf features,a C4.5 classifier over the public and the sonar features with logistic regression. All methods were performed by the machine learning toolkit Weka [23]. The F-measures and ROC values are shown in Table 2. LDA improved a relative 11% over the F-measure and 1.5% over the ROC of the remaining combined features. We also performed single-feature classification for the three best performing LDA predictions over the UK2006-WEBSPAM corpus. Here we have 2125 sites labeled as spam, and 8082 feature set text (SVM) public & text & sonar (log) public & text & sonar & lda (log) F 0.554 0.601 0.667 ROC 0.864 0.954 0.969 Table 2: F/ROC values pair of topic numbers 5/50 10/50 5/10 F 0.704 0.735 0.723 ROC 0.881 0.902 0.906 [8] [9] [10] Table 3: F-measures and ROC values for UK2006WEBSPAM [11] labeled as non-spam. The parameters and the setup was the same as above. The results can be seen at Table 3. [12] Conclusion and future work We presented a novel multi-corpus LDA technique that resulted in a relative improvement of about 10% over a strong content and link feature baseline. Although apparently the UK2007-WEBSPAM data is much more sensitive to content than to link features, we reached improvement over the UK2006-WEBSPAM data as well. We believe that similar to the success of email spam filtering methods [7] semantic analysis to spam filtering is a promising direction. In future work we plan to implement the email content filtering methods of [7] and test its combination with LDA. 4. [13] [14] [15] [16] REFERENCES [1] J. Abernethy, O. Chapelle, and C. Castillo. WITCH: A New Approach to Web Spam Detection. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008. [2] E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. Soffer. The Connectivity Sonar: Detecting site functionality by structural patterns. In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia (HT), pages 38–47, Nottingham, United Kingdom, 2003. [3] I. Bhattacharya and L. Getoor. A latent dirichlet model for unsupervised entity resolution. SIAM International Conference on Data Mining, 2006. [4] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(5):993–1022, 2003. [5] A. Bratko, B. Filipič, G. Cormack, T. Lynam, and B. Zupan. Spam Filtering Using Statistical Data Compression Models. The Journal of Machine Learning Research, 7:2673–2698, 2006. [6] C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: Web spam detection using the web topology. Technical report, DELIS – Dynamically Evolving, Large-Scale Information Systems, 2006. [7] G. Cormack. Content-based Web Spam Detection. In Proceedings of the 3rd International Workshop on [17] [18] [19] [20] [21] [22] [23] [24] Adversarial Information Retrieval on the Web (AIRWeb), 2007. L. Fei-Fei and P. Perona. A Bayesian hierarchical model for learning natural scene categories. Proc. CVPR, 5, 2005. D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics – Using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases (WebDB), pages 1–6, Paris, France, 2004. D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the world wide web. In Proceedings of the 28th ACM International Conference on Research and Development in Information Retrieval (SIGIR), Salvador, Brazil, 2005. T. Griffiths. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1):5228–5235, 2004. Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Chiba, Japan, 2005. G. Heinrich. Parameter estimation for text analysis. Technical report, Technical Report, 2004. M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. SIGIR Forum, 36(2):11–22, 2002. T. Hofmann. Probabilistic latent semantic indexing. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50–57, 1999. T. Lynam, G. Cormack, and D. Cheriton. On-line spam filter fusion. Proc. of the 29th international ACM SIGIR conference on Research and development in information retrieval, pages 123–130, 2006. T. Minka and J. Lafferty. Expectation-propagation for the generative aspect model. Uncertainty in Artificial Intelligence (UAI), 2002. A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 15th International World Wide Web Conference (WWW), pages 83–92, Edinburgh, Scotland, 2006. X.-H. Phan. http://gibbslda.sourceforge.net/. A. Singhal. Challenges in running a commercial search engine. In IBM Search and Collaboration Seminar 2004. IBM Haifa Labs, 2004. A. Singhal, G. Salton, M. Mitra, and C. Buckley. Document length normalization. Information Processing and Management, 32(5):619–633, 1996. J. Sivic, B. Russell, A. Efros, A. Zisserman, and W. Freeman. Discovering Objects and their Localization in Images. Computer Vision, ICCV 2005. Tenth IEEE International Conference on, 1, 2005. I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, second edition, June 2005. D. Xing and M. Girolami. Employing Latent Dirichlet Allocation for fraud detection in telecommunications. Pattern Recognition Letters, 28(13):1727–1734, 2007.