Many language processing tasks are dependent on large databases of lexical semantic information, such as WordNet. These hand-built resources are tremendously time-consuming to create and may be lacking in coverage. They may be particularly inappropriate for text from a single domain, both because domain-specific terms are missing and because the lexicon contains many words or meanings which would be extremely rare in that domain. This thesis describes statistical techniques to automatically extract semantic information about words from text; specifically, given a large corpus of text and no additional sources of semantic information, we build a hierarchy of nouns appearing in the text. The hierarchy is in the form of an IS-A tree, where the nodes of the tree contain one or more nouns, and the ancestors of a node contain hypernyms of the nouns in that node. (An English word A is said to be a hypernym of a word B if native speakers of English accept the sentence “B is a (kind of) A.”) The techniques presented here could be used in the construction of updated or domain-specific semantic resources as needed. The methods described here provide a substantial improvement over previously published results; while we could previously produce a hierarchy whose internal nodes were judged to be correct hypernyms for 33% of the nouns beneath them, we can now achieve 56% on this measure. The thesis also includes a detailed discussion of a particular subproblem: determining which of a pair of nouns is more specific. We identify numerical measures which can be easily computed from a text corpus and which can answer this question with over 80% accuracy.
Cited By
- Dietz E, Vandic D and Frasincar F TaxoLearn Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01, (58-65)
- Ennals R, Byler D, Agosta J and Rosario B What is disputed on the web? Proceedings of the 4th workshop on Information credibility, (67-74)
- Przepiórkowski A, Degórski Ł, Wójtowicz B, Spousta M, Kuboň V, Simov K, Osenova P and Lemnitzer L Towards the automatic extraction of definitions in Slavic Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies, (43-50)
- Snow R, Jurafsky D and Ng A Semantic taxonomy induction from heterogenous evidence Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, (801-808)
Recommendations
Automatic construction of a hypernym-labeled noun hierarchy from text
ACL '99: Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational LinguisticsPrevious work has shown that automatic methods can be used in building semantic lexicons. This work goes a step further by automatically creating not just clusters of related words, but a hierarchy of nouns and their hypernyms, akin to the hand-built ...
The Automatic Construction Method of Mongolian WordNet Noun Sets of Synonyms
ICINIS '11: Proceedings of the 2011 4th International Conference on Intelligent Networks and Intelligent SystemsAutomatic construction of Mongolian noun sets of synonyms is the fundamental work to be accomplished first when developing the noun subnet of Mongolian Word Net. This article proposed an approach of transforming Chinese or English Word Net to Mongolian ...
Automatic Persian WordNet construction
COLING '10: Proceedings of the 23rd International Conference on Computational Linguistics: PostersIn this paper, an automatic method for Persian WordNet construction based on Prenceton WordNet 2.1 (PWN) is introduced. The proposed approach uses Persian and English corpora as well as a bilingual dictionary in order to make a mapping between PWN ...