Discovering a Term Taxonomy from Term Similarities Using Principal Component Analysis

Bast, Holger; Dupret, Georges; Majumdar, Debapriyo; Piwowarski, Benjamin

doi:10.1007/11908678_7

Holger Bast²⁸,
Georges Dupret²⁹,
Debapriyo Majumdar²⁸ &
…
Benjamin Piwowarski²⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4289))

Included in the following conference series:

406 Accesses
6 Citations

Abstract

We show that eigenvector decomposition can be used to extract a term taxonomy from a given collection of text documents. So far, methods based on eigenvector decomposition, such as latent semantic indexing (LSI) or principal component analysis (PCA), were only known to be useful for extracting symmetric relations between terms. We give a precise mathematical criterion for distinguishing between four kinds of relations of a given pair of terms of a given collection: unrelated (car – fruit), symmetrically related (car – automobile), asymmetrically related with the first term being more specific than the second (banana – fruit), and asymmetrically related in the other direction (fruit – banana). We give theoretical evidence for the soundness of our criterion, by showing that in a simplified mathematical model the criterion does the apparently right thing. We applied our scheme to the reconstruction of a selected part of the open directory project (ODP) hierarchy, with promising results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Incremental Singular Value Decomposition Using Extended Power Method

Spectral Clustering

A Tale of Four Metrics

References

Agichtein, E., Gravano, L.: Snowball: Extracting relations from large plain-text collections. In: 5th Conference on Digital Libraries (DL 2000) (2000)
Google Scholar
Anick, P.G., Tipirneni, S.: The paraphrase search assistant: terminological feedback for iterative information seeking. In: SIGIR 1999: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 153–159. ACM Press, New York (1999)
Chapter Google Scholar
Bast, H., Majumdar, D.: Why spectral retrieval works. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 11–18. ACM, New York (2005)
Chapter Google Scholar
Chuang, S.-L., Chien, L.-F.: A practical web-based approach to generating topic hierarchy for text segments. In: CIKM 2004: Proceedings of the Thirteenth ACM conference on Information and knowledge management, pp. 127–136. ACM Press, New York (2004)
Chapter Google Scholar
Cimiano, P., Ladwig, G., Staab, S.: Gimme’ the context: context-driven automatic semantic annotation with c-pankow. In: 14th International Conference on the World Wide Web (WWW 2005), pp. 332–341 (2005)
Google Scholar
Cimiano, P.B.P., Magnini, B.: Ontology Learning from Text: Methods, Evaluation and Applications. In: Frontiers in Artificial Intelligence and Applications, vol. 123. IOS Press, Amsterdam (2005)
Google Scholar
Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., McCurley, K., Rajagopalan, S., Tomkins, A., Tomlin, J., Zienberer, J.: A case for automated large scale semantic annotation. J. Web Semantics 1(1) (2003)
Google Scholar
Dupret, G.: Latent concepts and the number orthogonal factors in latent semantic analysis. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 221–226. ACM Press, New York (2003)
Google Scholar
Dupret, G.: Latent semantic indexing with a variable number of orthogonal factors. In: Proceedings of the RIAO 2004, Coupling approaches, coupling media and coupling languages for information retrieval, pp. 673–685, Centre de Hautes Etudes Internationales d’informatique documentaire, C.I.D., April 26-28 (2004)
Google Scholar
Dupret, G., Piwowarski, B.: Deducing a Term Taxonomy from Term Similarities. In: ECML/PKDD 2005 Workshop on Knowledge Discovery and Ontologies (2005)
Google Scholar
Dupret, G., Piwowarski, B.: Principal components for automatic term hierarchy building. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209, pp. 37–48. Springer, Heidelberg (2006)
Chapter Google Scholar
Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. Chapman & Hall/CRC (May 15, 1994)
Google Scholar
Etzioni, O., Cafarella, M., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D., Yates, A.: Unsupervised named-entity extraction from the web: an experimental study. Artificial Intelligence 165(1), 91–134 (2005)
Article Google Scholar
Glover, E., Pennock, D.M., Lawrence, S., Krovetz, R.: Inferring hierarchical descriptions. In: CIKM 2002: Proceedings of the eleventh international conference on Information and knowledge management, pp. 507–514. ACM Press, New York (2002)
Chapter Google Scholar
Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th conference on Computational linguistics, Morristown, NJ, USA, pp. 539–545. Association for Computational Linguistics (1992)
Google Scholar
Hearst, M.A.: Automated discovery of wordnet relations. In: Fellbaum, e., Christiane (eds.) WordNet: An Electronic Lexical Database, MIT Press, Cambridge (1998)
Google Scholar
Joho, H., Coverson, C., Sanderson, M., Beaulieu, M.: Hierarchical presentation of expansion terms. In: SAC 2002: Proceedings of the 2002 ACM symposium on Applied computing, pp. 645–649. ACM Press, New York (2002)
Chapter Google Scholar
Lawrie, D., Croft, W.: Discovering and comparing topic hierarchies. In: Proceedings of RIAO 2000 (2000)
Google Scholar
Lawrie, D., Croft, W.B., Rosenberg, A.: Finding topic words for hierarchical summarization. In: SIGIR 2001: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 349–357. ACM Press, New York (2001)
Chapter Google Scholar
Lawrie, D.J., Croft, W.B.: Generating hierarchical summaries for web searches. In: SIGIR 2003: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 457–458. ACM Press, New York (2003)
Chapter Google Scholar
Maedche, A., Staab, S.: Discovering conceptual relations from text. In: 14th European Conference on Artifial Intelligence (ECAI 2000), pp. 321–325 (2000)
Google Scholar
Nanas, N., Uren, V., Roeck, A.D.: Building and applying a concept hierarchy representation of a user profile. In: SIGIR 2003: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 198–204. ACM Press, New York (2003)
Chapter Google Scholar
Papadimitriou, C.H., Tamaki, H., Raghavan, P., Vempala, S.: Latent semantic indexing: a probabilistic analysis. In: Proceedings PODS 1998, pp. 159–168 (1998)
Google Scholar
Park, Y.C., Han, Y.S., Choi, K.-S.: Automatic thesaurus construction using bayesian networks. In: CIKM 1995: Proceedings of the fourth international conference on Information and knowledge management, pp. 212–217. ACM Press, New York (1995)
Chapter Google Scholar
Sanderson, M., Croft, B.: Deriving concept hierarchies from text. In: SIGIR 1999: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 206–213. ACM Press, New York (1999)
Chapter Google Scholar
Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 11–21 (1972) (Reprinted in B. C. Griffith (ed.) Key Papers in Information Science (1980) Willett, P. (ed.) Document Retrieval Systems, 1988)
Article Google Scholar
Uren, V., Cimiano, P., Iria, J., Handschuh, S., Vargas-Vera, M., Motta, E., Ciravegna, F.: Semantic annotation for knowledge management: Requirements and a survey of the state of the art. Journal of Web Semantics 4(1), 14–28 (2006)
Google Scholar
Volz, R., Handschuh, S., Staab, S., Stojanovic, L., Stojanovic, N.: Unveiling the hidden bride: deep annotation for mapping and migrating legacy data to the semantic web. Journal of Web Semantics 1(2), 187–206 (2004)
Google Scholar
Woods, W.A.: Conceptual indexing: A better way to organize knowledge. Technical report, Sun Labs Technical Report: TR-97-61 (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

Max-Planck-Institut für Informatik, Saarbrücken
Holger Bast & Debapriyo Majumdar
Yahoo! Research Latin America,
Georges Dupret & Benjamin Piwowarski

Authors

Holger Bast
View author publications
You can also search for this author in PubMed Google Scholar
Georges Dupret
View author publications
You can also search for this author in PubMed Google Scholar
Debapriyo Majumdar
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Piwowarski
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Natural Language Processing, Institute for Computer Science, University of Leipzig,
Markus Ackermann
Department of Computer Science, K.U. Leuven, B-3001, Heverlee, Belgium
Bettina Berendt
Dept. of Knowledge Technologies, Jozef Stefan Institute, Jamova 39, 1000, Ljubljana, Slovenia
Marko Grobelnik
Knowledge & Data Engineering Group, University of Kassel, Wilhelmshöher Allee 73, D-34121, Kassel, Germany
Andreas Hotho
Jožef Stefan Institute, Jamova 39, 1000, Ljubljana, Slovenia
Dunja Mladenič
Dipartimento di Informatica, Università di Bari, Via E. Orabona, 4, 70125, Bari, Italia
Giovanni Semeraro
Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, Germany
Myra Spiliopoulou
Research Center L3S, Appelstr. 9a, D-30167, Hannover, Germany
Gerd Stumme
Dept. Information and Knowledge Engineering, University of Economics, Prague, Winston Churchill Sq. 4, 130 67 Praha 3, Prague, Czech Republic
Vojtěch Svátek
Human Computer Studies Lab, University of Amsterdam, Kruislaan 419, 1089, Amsterdam, VA, The Netherlands
Maarten van Someren

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bast, H., Dupret, G., Majumdar, D., Piwowarski, B. (2006). Discovering a Term Taxonomy from Term Similarities Using Principal Component Analysis. In: Ackermann, M., et al. Semantics, Web and Mining. EWMF KDO 2005 2005. Lecture Notes in Computer Science(), vol 4289. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11908678_7

Download citation

DOI: https://doi.org/10.1007/11908678_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-47697-9
Online ISBN: 978-3-540-47698-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics