A Comparative Analysis of Latent Variable Models for Web Page Classification

István Bíró; Jácint  Szabó

A Comparative Analysis of Latent Variable Models for Web Page Classification István Bı́ró András Benczúr Jácint Szabó Data Mining and Web Search Research Group Informatics Laboratory Computer and Automation Research Institute Hungarian Academy of Science Budapest, Hungary {ibiro,benczur,jacint}@ilab.sztaki.hu Abstract A main challenge for Web content classification is how to model the input data. This paper discusses the application of two text modeling approaches, Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA), in the Web page classification task. We report results on a comparison of these two approaches using different vocabularies consisting of links and text. Both models are evaluated using different numbers of latent topics. Finally, we evaluate a hybrid latent variable model that combines the latent topics resulting from both LSA and LDA. This new approach turns out to be superior to the basic LSA and LDA models. In our experiments with categories and pages obtained from the ODP web directory the hybrid model achieves an averaged F-measure value of 0.852 and an averaged ROC value of 0.96. 1. Introduction Web content classification is expected to help with the automatic generation of topic directories, community detection, advertise blocking and webspam filtering, among other applications. A main problem for classification is how to model the input data. Attempting to model text using a relatively small set of representative topics not only can help alleviate the effect of the curse of dimensionality, but it can also help avoid the false-negative match problem that arises when documents with similar topics but different term vocabulary cannot be associated. A corpus of web pages can be characterized by the individual words and structure of each particular page (intradocument structure), through labeled hyperlinks or recurrent words relating one page to another (inter-document structure), and by the semantic relations between words, Ana Maguitman Grupo de Investigación en Recuperación de Información y Gestión del Conocimiento Departamento de Cs. e Ing. de la Computación Universidad Nacional del Sur - CONICET Bahı́a Blanca, Argentina agm@cs.uns.edu.ar which defines the concept- or topic-space. While traditional feature selection schemes [14] have some appealing characteristics, they are deficient in revealing the inter-or intra-document statistical structure of the corpus. To address these limitations other dimensionality reduction techniques have been proposed including Latent Semantic Analysis (LSA) [6] and Probabilistic Latent Semantic Analysis (PLSA) [13]. These approaches can achieve significant reduction of the feature-space dimensionality and have found successful application not only in tasks where the huge number of features would have make it impossible to process the dataset further but also in applications where documents are to be compared in concept- or topic-space. A successful text modeling approach is Latent Dirichlet Allocation (LDA) developed by Blei, Ng and Jordan [2]. LDA models every topic as a distribution over the terms of the vocabulary, and every document as a distribution over the topics. These distributions are sampled from Dirichlet distributions. This paper presents an evaluation of LSA and LDA as text modeling approaches for the task of supervised text categorization. We compare the performance of both methods for Web page classification and discuss the benefits of integrating both approaches to capture various aspects of the modeled material. 2. Background An important question that arises at the moment of implementing a Web page classifier is how to model each page. The goal is to obtain compact representations that are sufficiently descriptive and discriminative of the topics associated with the pages. Three important questions that need to be answered at the moment of designing a Web page classifier are (1) what features should be extracted from the pages, (2) how to use these features to model the content of each page, and (3) what algorithm should be used to issue a prediction of a page category. Web page classification is significantly different from traditional text classification because of the presence of links. While “text-only” approaches often provide a good indication of relatedness, the incorporation of link signals can considerably improve methods for grouping similar resources [24]. However, running a topic prediction algorithm by taking directly the text and links of the pages has some limitations. The main problem is that the underlying topics that can lead to a semantic representation of the pages remain hidden. Therefore, even semantically similar pages can have low similarity if they don’t share a significant number of terms and links. To address these issues, some modeling approaches that attempt to uncover the hidden structure of a document collection have been proposed. Sections 2.1 and 2.2 present an overview of LSA and LDA, two text modeling techniques where documents are associated with latent topics. Documents can then be compared by means of their topics and therefore documents can have high similarity even if they don’t share any features—as long as these features are related in a sense to be described in the next two sections. Contrary to the usual setup, links are not used to propagate topics, but instead we treat them as words and build latent topic models on them. Although links are not words of a natural language, and so one cannot take it for granted that applying latent topic models on them will work, the results of this paper justify the use of such models. LDA with a vocabulary consisting of links was first considered by Zhang, Qiu, Giles, Foley, and Yen [26]. Their model, called SSN-LDA, is exactly our LDA model on links, applied in a social network context. Aside from this paper, we are not aware of any results on latent topic models built on links. cepts, the documents and their features are projected into a lower-dimension concept space. As a consequence, new and previously latent relations will arise between documents and features. In order to apply LSA we first generate a document-feature matrix M from the given corpus. Then, singular-value decomposition (SVD) is applied to M, resulting in matrices U, S, and V. The SVD decomposition is such that M = USVT , U and V have orthogonal columns (i.e., UT U=I and VT V=I) and S has entries only along the diagonal. The next step is to reduce the dimensions of S to a significant lower value k to obtain a matrix S’. The same reduction is performed on the row dimension and column dimension of U and VT respectively, resulting in the lowerrank matrices U’ and V’T . By multiplying matrices U’, S’ and V’T we obtain a new matrix M’ that relates documents and their features through a concept-space. The matrix M’ can be thought of as a low-rank approximation to the document-feature matrix M. This reduction is sometimes necessary when the original document-feature matrix is presumed too large for the computing resources or when it is considered noisy. Most commonly, this lowrank approximation is performed to represent documents in concept space. A problem with this approach is that the resulting dimensions might be difficult to interpret. LSA assumes that documents and features form a joint Gaussian model, while a Poisson distribution is typically observed. To overcome some of these problems, Hofmann [13] introduced Probabilistic LSA (PLSA), which is a generative, graphical model enhancing latent semantic analysis by a sounder probabilistic model. Although PLSA had promising results, it suffers from two limitations: the number of parameters is linear in the number of documents, and it is not possible to make inference for unseen data. In this paper PLSA is not applied, only LSA. 2.1. Latent Semantic Analysis 2.2. Latent Dirichlet Allocation LSA is a theory and method for extracting and representing the contextual usage meaning of words by statistical computations applied to a large corpus [6]. The method requires a corpus of documents from any domain and the vector space model [22] is used to represent this corpus. In this model a document is represented as a vector where each dimension corresponds to a separate feature from the document. A feature could be a term or any other unit that is a representative attribute of the documents in the given corpus. If a feature occurs in the document, its value in the vector is non-zero. A common way of computing these values is the tf-idf weighting scheme [21]. An important step in LSA is to transform the documentfeature vector space into a document-concept and conceptdocument vector space. By reducing the number of con- In this section we present a short overview of LDA [2], for a detailed elaboration, we refer the reader to [12]. The LDA method takes a vocabulary V consisting of features, a set T of k topics and n documents of arbitrary length. For every topic z a distribution ϕz on V is sampled from Dir(β), where β ∈ RV+ is a smoothing parameter. Similarly, for every document d a distribution ϑd on T is sampled from Dir(α), where α ∈ RT+ is a smoothing parameter. The words of the documents are drawn as follows: for every word-position of document d a topic z is drawn from ϑd , and then a term (or other useful feature) is drawn from ϕz and filled into the position. LDA can be thought of as a Bayesian network, see Figure 1. One method for finding the LDA model via inference is using Gibbs sampling [9]. (Additional methods are varia- β ϕ ∼ Dir(β) 3. Comparing and Integrating LSA and LDA in Web Page Classification k Our goal is to compare LSA and LDA as text modeling approaches in the Web page classification task. For that purpose, we run a series of tests described next. 3.1 α ϑ ∼ Dir(α) z∼ϑ w ∼ ϕz n Figure 1. LDA as a Bayesian network tional expectation maximization [2], and expectation propagation [16]). Gibbs sampling is a Monte Carlo Markovchain algorithm for sampling from a joint distribution p(x), x ∈ Rn , if all conditional distributions p(xi |x−i ) are known (x−i = (x1 , . . . , xi−1 , xi+1 , . . . , xn )). In LDA the goal is to estimate the distribution p(z|w) for z ∈ T P , w ∈ V P where P denotes the set of word-positions in the documents. Thus in the Gibbs sampling one has to calculate p(zi |z−i , w) for i ∈ P . This has an efficiently computable closed form (for a deduction, see [12]) p(zi |z−i , w) = ntzii − 1 + βti nzi − 1 + αzi P P · d . (1) nzi − 1 + t βt nd − 1 + z αz Here d is the document of position i, ti is the actual word in position i, ntzii is the number of positions with topic zi and term ti , nzi is the number of positions with topic zi , nzdi is the number of topics zi in document d, and nd is the length of document d. After a sufficient number of iterations we arrive at a topic assignment sample z. Knowing z, ϕ and ϑ are estimated as ϕz,t = ntz + βt P n z + t βt (2) and nzd + αz P . (3) nd + z αz We call the above method model inference. After the model is built, we make unseen inference for every new, unseen document d. The ϑ topic-distribution of d can be estimated exactly as in (3) once we have a sample from its word-topic assignment z. Sampling z can be performed with a similar method as before, but now only for the positions i in d: ϑd,z = ñtzii − 1 + βti nzi − 1 + αzi P P · d . (4) p(zi |z−i , w) = ñzi − 1 + t βt nd − 1 + z αz The notation ñ refers to the union of the whole corpus and document d. Input Data In order to run our comparison tests we used a subset of the Open Directory Project (ODP)1 corpus, from which we chose the 8 top-level categories: Arts, Business, Computers, Health, Science, Shopping, Society, Sports. The resulting corpus contained more than 350K documents distributed among these eight categories. We randomly split this collection of pages into training (80%) and test (20%) collections. Text and links were extracted from the collected pages and used to generate two vocabulary sets. We refer to the vocabulary set based on text as T and to the vocabulary set based on links as L. In order to generate T we extracted all terms from the selected Web pages and kept only the 30K terms that occurred with the highest frequency. This vocabulary was filtered further by eliminating stop-words2 and keeping only terms consisting of alphanumeric characters, including those containing the hyphen, and the apostrophe. Documents with length less than 1000 terms were discarded. Finally, the text was run through a tree-tagger software3 . The final number of words in T was 21308. The vocabulary set L combines the incoming links and outgoing links associated with our corpus, that is the vocabulary consists of web pages linking to or linked by a page in our corpus. Incoming links were obtained using the Google Web API. To avoid circularity in our classification tests, we took special care to filter out those links coming from topical directories. Finally, we extracted all the outgoing links from the pages and added them to L. Links that occur in less than 10 documents were removed from our vocabulary and we only kept (train and test) documents with at least 3 links. The size of vocabulary L was 44561. It is important to notice that distinct portions of the training and test data are kept for the link and text vocabularies. By intersecting these two test sets we obtain a common test set with 1218 pages. 3.2 Experiments Using the training collection we created models for LSA and LDA based on different vocabularies and using differ1 http://dmoz.org 2 We used the stop-word list available http://www.lextek.com/manuals/onix/stopwords1.html. 3 http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ at ent number of latent topics. By taking this approach, we were able to compare the performance of our models across different dimensions. For LSA we used the tf.idf pivoted scheme [23] and applied SVD with dimensions 5, 15 and 30. LDA was run for 1000 iterations to generate 5, 15 and 30 latent topics. In what follows we will use a simple convention to name the tested models. For example, an LSA model that uses text vocabulary and 15 latent topics is referred to as LSA-T15, while one that uses link vocabulary for the same number of topics is named LSA-L-15. In addition, one could explore alternative text and link combinations, as well as the integration of latent topics from LSA and LDA. For example, by combining LDA-T-15 with LDA-L-30 we obtained the LDA-T-15-L-30 model, and by combing, LDAL-15-T-30 with LSA-L-15-T-30 we obtained LDA-L-15-T30-LSA-L-15-T-30. In order to generate the LSA models we used the Lanczos code of svdpack [1] to run SVD. For LDA we used Phan’s GibbsLDA++ C++-code [19]. Once our models were generated we used the Weka machine learning toolkit [25] with 10 fold cross validation to run two different binary classifiers: C4.5 and SVM, separately for every category. The F-measures and ROC values presented in the rest of the paper are average over the 8 categories. 4. Results In this section we report the averaged F-measures and ROC values for the tested models. Although we generated models with 5, 15 and 30 latent topics, we will omit the results for the models with 5 topics given that they performed poorly. Tables 1 and 2 compare the performance of the LSA and LDA models using different text and link vocabulary combinations (FMR stands for F-measure). We can observe that for both models the text vocabulary is superior to the link vocabulary. This is not surprising considering that the number of links associated with each pages is usually much smaller than the number of terms. However, we observe that the models that combine text and link features are appreciably superior to those based on text only. The most effective model for LSA is LSA-L-15-T-30, which combines 15 link-based topics with 30 text-based topics. Similarly, the best LDA model is LDA-L-15-T-30. Table 3 summarizes the improvements obtained when link features are included in the vocabulary, the classifier was SVM. We also observe that LDA was superior to LSA for all the tested combinations. Table 4 shows the averaged FMR & ROC values for the best LSA/LDA configuration, using SVM. Finally, we looked into the combination of the two latent variable models. Interestingly, by combining the best Experiments L-15 L-30 T-15 T-30 L-15-T-15 L-15-T-30 L-30-T-15 L-30-T-30 SVM 0.105/0.514 0.136/0.532 0.531/0.824 0.562/0.839 0.666/0.881 0.710/0.894 0.671/0.882 0.708/0.893 C4.5 0.198/0.575 0.433/0.732 0.487/0.722 0.446/0.687 0.558/0.753 0.561/0.755 0.594/0.783 0.579/0.768 Table 1. Averaged FMR/ROC values with LSA Experiments L-15 L-30 T-15 T-30 L-15-T-15 L-15-T-30 L-30-T-15 L-30-T-30 SVM 0.249/0.724 0.367/0.758 0.464/0.834 0.619/0.876 0.699/0.900 0.765/0.938 0.687/0.896 0.757/0.921 C4.5 0.385/0.738 0.458/0.761 0.435/0.686 0.453/0.710 0.604/0.787 0.571/0.756 0.594/0.771 0.575/0.767 Table 2. Averaged FMR/ROC values with LDA Method LSA-T-30 LSA-L-15-T-30 Improvement LDA-T-30 LDA-L-15-T-30 Improvement F-measure 0.562 0.710 26.3% 0.619 0.765 23.6% ROC 0.839 0.894 6.6% 0.876 0.938 7.1% Table 3. Comparison of text (T) and link (L) based classification results configurations of LDA and LSA we obtain the best model with an averaged FMR value of 0.852 and an averaged ROC value of 0.96. Table 5 summarizes these results. The improvement registered by integrating both models points to the fact that the LDA and LSA models capture different aspects of the corpus hidden structure and that the combination of these models can be highly beneficial. 5. Discussion and Conclusions It has long been recognized that text and link features extracted from pages can help discover Web communities, which often lead to the extraction of topically coherent subgraphs useful for clustering or classifying Web pages. Many algorithms based solely on link information have been pro- Method LSA-L-15-T-30 LDA-L-15-T-30 Improvement F-measure 0.710 0.765 7.7% ROC 0.894 0.938 4.9% NKTH and MinCyT. References Table 4. Averaged FMR & ROC values for the best LSA/LDA configuration Method LSA-L-15-T-30 LDA-L-15-T-30 LSA-L-15-T-30-LDA-L-15-T-30 Improvement LDA-LSA over LSA Improvement LDA-LSA over LDA F-measure 0.710 0.765 0.852 20% 11.3% ROC 0.894 0.938 0.96 7.4% 2.3% Table 5. Averaged FMR & ROC values for the best LSA, LDA, and LSA-LDA combined configuration posed to partition hypertext environments [11, 3, 20], to identify and examine the structure of topics on the Web [8, 7, 4] and for Web content categorization [10]. Other algorithms use the hyperlink structure of the Web to find related pages [15, 5]. An approach, totally different from ours, that combines latent variable models to identify topics on the Web is Link-PLSA-LDA [18]. LSA and LDA are based on different principles. On the one hand, LSA assumes that words and documents can be represented as points in Euclidean space. On the other hand, LDA (like other statistical models) assumes that the semantic properties of words and documents are expressed in terms of probabilistic topics. Although some recent theoretical work has been carried out comparing Euclidean and probabilistic latent variable models (e.g, [17]), to the best of our knowledge this is the first attempt to provide a thorough empirical comparison of the two modeling approaches in the Web page classification task. In our evaluations we observe that although LDA is superior to LSA for all the tested configurations, the improvements achieved by combining latent structures from both approaches are noteworthy. Despite the different underlying assumption of these two approaches and the seeming superiority of LDA, each one appears to have something unique to contribute at the moment of modeling text and links for classification. 6 Acknowledgements We wish to thank anonymous reviewers for helpful comments and suggestions. This research work is partially supported by an international cooperation project funded by [1] M. Berry. SVDPACK: A Fortran-77 Software Library for the Sparse Singular Value Decomposition. 1992. [2] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(5):993– 1022, 2003. [3] R. A. Botafogo and B. Shneiderman. Identifying aggregates in hypertext structures. In Proceedings of the third annual ACM conference on Hypertext, pages 63–74. ACM Press, 1991. [4] S. Chakrabarti, M. M. Joshi, K. Punera, and D. M. Pennock. The structure of broad topics on the Web. In Proceedings of the eleventh international conference on World Wide Web, pages 251–262. ACM Press, 2002. [5] J. Dean and M. R. Henzinger. Finding related pages in the World Wide Web. Computer Networks (Amsterdam, Netherlands: 1999), 31(11–16):1467–1479, 1999. [6] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407, 1990. [7] S. Dill, S. R. Kumar, K. S. McCurley, S. Rajagopalan, D. Sivakumar, and A. Tomkins. Self-similarity in the Web. In The VLDB journal, pages 69–78, 2001. [8] D. Gibson, J. M. Kleinberg, and P. Raghavan. Inferring Web communities from link topology. In UK Conference on Hypertext, pages 225–234, 1998. [9] T. Griffiths. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1):5228–5235, 2004. [10] Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Web content categorization using link information. Technical report, Stanford University, 2007. [11] Y. Hara and Y. Kasahara. A set-to-set linking strategy for hypertext systems. In Proceedings of the conference on Office information systems, pages 131–135. ACM Press, 1990. [12] G. Heinrich. Parameter estimation for text analysis. Technical report, Technical Report, 2004. [13] T. Hofmann. Probabilistic latent semantic indexing. In SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50–57, New York, NY, USA, 1999. ACM. [14] H. Liu and H. Motoda. Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers, Norwell, MA, USA, 1998. [15] M. Marchiori. The quest for correct information on the Web: hyper search engines. In Selected papers from the sixth international conference on World Wide Web, pages 1225–1235. Elsevier Science Publishers Ltd., 1997. [16] T. Minka and J. Lafferty. Expectation-propagation for the generative aspect model. Uncertainty in Artificial Intelligence (UAI), 2002. [17] M.Steyvers and T. Griffiths. Probabilistic topic models. In S. D. T. Landauer, D.S. McNamara and W. Kintsch, editors, Handbook of Latent Semantic Analysis. Mahwah, NJ: Erlbaum, 2007. [18] R. Nallapati and W. Cohen. Link-plsa-lda: A new unsupervised model for topics and influence in blogs. In International Conference for Weblogs and Social Media, 2008. [19] X.-H. Pahn. http://gibslda.sourceforge.net/. [20] J. Pitkow and P. Pirolli. Life, death, and lawfulness on the electronic frontier. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 383– 390. ACM Press, 1997. [21] G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513–523, 1988. [22] G. Salton, C. Yang, and C. Yu. A theory of term importance in automatic text analysis. Journal of the American Society for Information Science, 36:33–44, 1975. [23] A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In SIGIR ’96: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pages 21–29, New York, NY, USA, 1996. ACM. [24] R. Weiss, B. Vélez, and M. A. Sheldon. HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering. In Proceedings of the the seventh ACM conference on Hypertext, pages 180–193. ACM Press, 1996. [25] I. H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques, 2nd Edition. Morgan Kaufmann, San Francisco, 2005. [26] H. Zhang, B. Qiu, C. Giles, H. Foley, and J. Yen. An LDAbased Community Structure Discovery Approach for LargeScale Social Networks. Intelligence and Security Informatics, 2007 IEEE, pages 200–207, 2007.

RELATED PAPERS

RELATED TOPICS

Log In

A Comparative Analysis of Latent Variable Models for Web Page Classification

A Comparative Analysis of Latent Variable Models for Web Page Classification