Abstract
We show that eigenvector decomposition can be used to extract a term taxonomy from a given collection of text documents. So far, methods based on eigenvector decomposition, such as latent semantic indexing (LSI) or principal component analysis (PCA), were only known to be useful for extracting symmetric relations between terms. We give a precise mathematical criterion for distinguishing between four kinds of relations of a given pair of terms of a given collection: unrelated (car – fruit), symmetrically related (car – automobile), asymmetrically related with the first term being more specific than the second (banana – fruit), and asymmetrically related in the other direction (fruit – banana). We give theoretical evidence for the soundness of our criterion, by showing that in a simplified mathematical model the criterion does the apparently right thing. We applied our scheme to the reconstruction of a selected part of the open directory project (ODP) hierarchy, with promising results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Agichtein, E., Gravano, L.: Snowball: Extracting relations from large plain-text collections. In: 5th Conference on Digital Libraries (DL 2000) (2000)
Anick, P.G., Tipirneni, S.: The paraphrase search assistant: terminological feedback for iterative information seeking. In: SIGIR 1999: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 153–159. ACM Press, New York (1999)
Bast, H., Majumdar, D.: Why spectral retrieval works. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 11–18. ACM, New York (2005)
Chuang, S.-L., Chien, L.-F.: A practical web-based approach to generating topic hierarchy for text segments. In: CIKM 2004: Proceedings of the Thirteenth ACM conference on Information and knowledge management, pp. 127–136. ACM Press, New York (2004)
Cimiano, P., Ladwig, G., Staab, S.: Gimme’ the context: context-driven automatic semantic annotation with c-pankow. In: 14th International Conference on the World Wide Web (WWW 2005), pp. 332–341 (2005)
Cimiano, P.B.P., Magnini, B.: Ontology Learning from Text: Methods, Evaluation and Applications. In: Frontiers in Artificial Intelligence and Applications, vol. 123. IOS Press, Amsterdam (2005)
Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., McCurley, K., Rajagopalan, S., Tomkins, A., Tomlin, J., Zienberer, J.: A case for automated large scale semantic annotation. J. Web Semantics 1(1) (2003)
Dupret, G.: Latent concepts and the number orthogonal factors in latent semantic analysis. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 221–226. ACM Press, New York (2003)
Dupret, G.: Latent semantic indexing with a variable number of orthogonal factors. In: Proceedings of the RIAO 2004, Coupling approaches, coupling media and coupling languages for information retrieval, pp. 673–685, Centre de Hautes Etudes Internationales d’informatique documentaire, C.I.D., April 26-28 (2004)
Dupret, G., Piwowarski, B.: Deducing a Term Taxonomy from Term Similarities. In: ECML/PKDD 2005 Workshop on Knowledge Discovery and Ontologies (2005)
Dupret, G., Piwowarski, B.: Principal components for automatic term hierarchy building. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209, pp. 37–48. Springer, Heidelberg (2006)
Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. Chapman & Hall/CRC (May 15, 1994)
Etzioni, O., Cafarella, M., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D., Yates, A.: Unsupervised named-entity extraction from the web: an experimental study. Artificial Intelligence 165(1), 91–134 (2005)
Glover, E., Pennock, D.M., Lawrence, S., Krovetz, R.: Inferring hierarchical descriptions. In: CIKM 2002: Proceedings of the eleventh international conference on Information and knowledge management, pp. 507–514. ACM Press, New York (2002)
Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th conference on Computational linguistics, Morristown, NJ, USA, pp. 539–545. Association for Computational Linguistics (1992)
Hearst, M.A.: Automated discovery of wordnet relations. In: Fellbaum, e., Christiane (eds.) WordNet: An Electronic Lexical Database, MIT Press, Cambridge (1998)
Joho, H., Coverson, C., Sanderson, M., Beaulieu, M.: Hierarchical presentation of expansion terms. In: SAC 2002: Proceedings of the 2002 ACM symposium on Applied computing, pp. 645–649. ACM Press, New York (2002)
Lawrie, D., Croft, W.: Discovering and comparing topic hierarchies. In: Proceedings of RIAO 2000 (2000)
Lawrie, D., Croft, W.B., Rosenberg, A.: Finding topic words for hierarchical summarization. In: SIGIR 2001: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 349–357. ACM Press, New York (2001)
Lawrie, D.J., Croft, W.B.: Generating hierarchical summaries for web searches. In: SIGIR 2003: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 457–458. ACM Press, New York (2003)
Maedche, A., Staab, S.: Discovering conceptual relations from text. In: 14th European Conference on Artifial Intelligence (ECAI 2000), pp. 321–325 (2000)
Nanas, N., Uren, V., Roeck, A.D.: Building and applying a concept hierarchy representation of a user profile. In: SIGIR 2003: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 198–204. ACM Press, New York (2003)
Papadimitriou, C.H., Tamaki, H., Raghavan, P., Vempala, S.: Latent semantic indexing: a probabilistic analysis. In: Proceedings PODS 1998, pp. 159–168 (1998)
Park, Y.C., Han, Y.S., Choi, K.-S.: Automatic thesaurus construction using bayesian networks. In: CIKM 1995: Proceedings of the fourth international conference on Information and knowledge management, pp. 212–217. ACM Press, New York (1995)
Sanderson, M., Croft, B.: Deriving concept hierarchies from text. In: SIGIR 1999: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 206–213. ACM Press, New York (1999)
Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 11–21 (1972) (Reprinted in B. C. Griffith (ed.) Key Papers in Information Science (1980) Willett, P. (ed.) Document Retrieval Systems, 1988)
Uren, V., Cimiano, P., Iria, J., Handschuh, S., Vargas-Vera, M., Motta, E., Ciravegna, F.: Semantic annotation for knowledge management: Requirements and a survey of the state of the art. Journal of Web Semantics 4(1), 14–28 (2006)
Volz, R., Handschuh, S., Staab, S., Stojanovic, L., Stojanovic, N.: Unveiling the hidden bride: deep annotation for mapping and migrating legacy data to the semantic web. Journal of Web Semantics 1(2), 187–206 (2004)
Woods, W.A.: Conceptual indexing: A better way to organize knowledge. Technical report, Sun Labs Technical Report: TR-97-61 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bast, H., Dupret, G., Majumdar, D., Piwowarski, B. (2006). Discovering a Term Taxonomy from Term Similarities Using Principal Component Analysis. In: Ackermann, M., et al. Semantics, Web and Mining. EWMF KDO 2005 2005. Lecture Notes in Computer Science(), vol 4289. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11908678_7
Download citation
DOI: https://doi.org/10.1007/11908678_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-47697-9
Online ISBN: 978-3-540-47698-6
eBook Packages: Computer ScienceComputer Science (R0)