This paper describes the XLING system participation in SemEval-2013 Crosslingual Word Sense Disam... more This paper describes the XLING system participation in SemEval-2013 Crosslingual Word Sense Disambiguation task. The XLING system introduces a novel approach to skip the sense disambiguation step by matching query sentences to sentences in a parallel corpus using topic models; it returns the word alignments as the translation for the target polysemous words. Although, the topic-model base matching underperformed, the matching approach showed potential in the simple cosine-based surface similarity matching.
In this paper we present the ongoing efforts to expand the depth and breath of the Open Multiling... more In this paper we present the ongoing efforts to expand the depth and breath of the Open Multilingual Wordnet coverage by introducing two new classes of non-referential concepts to wordnet hierarchies: interjections and numeral classifiers. The lexical semantic hierarchy pioneered by Princeton Wordnet has traditionally restricted its coverage to referential and contentful classes of words: such as nouns, verbs, adjectives and adverbs. Previous efforts have been employed to enrich wordnet resources including, for example, the inclusion of pronouns, determiners and quantifiers within their hierarchies. Following similar efforts, and motivated by the ongoing semantic annotation of the NTU-Multilingual Corpus, we decided that the four traditional classes of words present in wordnets were too restrictive. Though non-referential, interjections and classifiers possess interesting semantics features that can be well captured by lexical resources like wordnets. In this paper, we will further ...
The Japanese language has absorbed large numbers of loanwords from many languages, in particular ... more The Japanese language has absorbed large numbers of loanwords from many languages, in particular English. As well as using single loanwords, compound nouns, multiword expressions (MWEs), etc. constructed from loanwords can be found in use in very large quantities. In this paper we describe a system which has been developed to segment Japanese loanword MWEs and construct likely English translations. The system, which leverages the availability of large bilingual dictionaries of loanwords and English n-gram corpora, achieves high levels of accuracy in discriminating between single loanwords and MWEs, and in segmenting MWEs. It also generates useful translations of MWEs, and has the potential to being a major aid to lexicographers in this area.
This research is aimed at developing a hierarchical alternation-based lexical architecture for ma... more This research is aimed at developing a hierarchical alternation-based lexical architecture for machine translation. The proposed architecture makes extensive use of information sharing in describing valency frames through derivational links from base frames, rather than as independent entities. This has advantages in descriptive e‐ciency, robustness and maintainability. The lexicon being developed is built up automatically from the Japanese component of an existing Japanese-English machine translation lexicon. The reconstruction process consists of analysing consistencies in selectional restrictions between valency frames, and postulating alternations where selectional restrictions are preserved on matching case slots; this was found to perform at 60.9% accuracy. All alternation candidates are incorporated into the flnal-version lexicon as derivational links, and expanded out at run time.
In this paper we compare Oxford Lexico and Merriam Webster dictionaries with Princeton WordNet wi... more In this paper we compare Oxford Lexico and Merriam Webster dictionaries with Princeton WordNet with respect to the description of semantic (dis)similarity between polysemous and homonymous senses that could be inferred from them. WordNet lacks any explicit description of polysemy or homonymy, but as a network of linked senses it may be used to compute semantic distances between word senses. To compare WordNet with the dictionaries, we transformed sample entry microstructures of the latter into graphs and cross-linked them with the equivalent senses of the former. We found that dictionaries are in high agreement with each other, if one considers polysemy and homonymy altogether, and in moderate concordance, if one focuses merely on polysemy descriptions. Measuring the shortest path lengths on WordNet gave results comparable to those on the dictionaries in predicting semantic dissimilarity between polysemous senses, but was less felicitous while recognising homonymy.
In this paper we discuss an ongoing effort to enrich students’ learning by involving them in sens... more In this paper we discuss an ongoing effort to enrich students’ learning by involving them in sense tagging. The main goal is to lead students to discover how we can represent meaning and where the limits of our current theories lie. A subsidiary goal is to create sense tagged corpora and an accompanying linked lexicon (in our case wordnets). We present the results of tagging several texts and suggest some ways in which the tagging process could be improved. Two authors of this paper present their own experience as students. Overall, students reported that they found the tagging an enriching experience. The annotated corpora and changes to the wordnet are made available through the NTU multilingual corpus and associated wordnets (NTU-MC).
This paper describes our attempts to add Indonesian definitions to synsets in the Wordnet Bahasa ... more This paper describes our attempts to add Indonesian definitions to synsets in the Wordnet Bahasa (Nurril Hirfana Mohamed Noor et al., 2011; Bond et al., 2014), to extract semantic relations between lemmas and definitions for nouns and verbs, such as synonym, hyponym, hypernym and instance hypernym, and to generally improve Wordnet. The original, somewhat noisy, definitions for Indonesian came from the Asian Wordnet project (Riza et al., 2010). The basic method of extracting the relations is based on Bond et al. (2004). Before the relations can be extracted, the definitions were cleaned up and tokenized. We found that the definitions cannot be completely cleaned up because of many misspellings and bad translations. However, we could identify four semantic relations in 57.10% of noun and verb definitions. For the remaining 42.90%, we propose to add 149 new Indonesian lemmas and make some improvements to Wordnet Bahasa and Wordnet in general.
This paper discusses the approach to multiword expressions being adopted in the LinGO English Res... more This paper discusses the approach to multiword expressions being adopted in the LinGO English Resource Grammar (http://lingo.stanford.edu), a broad-scale bidirectional grammar of English in the HPSG framework. We discuss how the lexicon of multiword expressions is encoded in a database and describe the implications for building a reusable lexical resource.
Past approaches to developing an effective lexicon component in a grammar development environment... more Past approaches to developing an effective lexicon component in a grammar development environment have suffered from a number of usability and efficiency issues. We present a lexical database module currently in use by a number of grammar development projects. The database module presented addresses issues which have caused problems in the past and the power of a database architecture provides a number of practical advantages as well as a solid framework for future extension.
This paper describes the XLING system participation in SemEval-2013 Crosslingual Word Sense Disam... more This paper describes the XLING system participation in SemEval-2013 Crosslingual Word Sense Disambiguation task. The XLING system introduces a novel approach to skip the sense disambiguation step by matching query sentences to sentences in a parallel corpus using topic models; it returns the word alignments as the translation for the target polysemous words. Although, the topic-model base matching underperformed, the matching approach showed potential in the simple cosine-based surface similarity matching.
In this paper we present the ongoing efforts to expand the depth and breath of the Open Multiling... more In this paper we present the ongoing efforts to expand the depth and breath of the Open Multilingual Wordnet coverage by introducing two new classes of non-referential concepts to wordnet hierarchies: interjections and numeral classifiers. The lexical semantic hierarchy pioneered by Princeton Wordnet has traditionally restricted its coverage to referential and contentful classes of words: such as nouns, verbs, adjectives and adverbs. Previous efforts have been employed to enrich wordnet resources including, for example, the inclusion of pronouns, determiners and quantifiers within their hierarchies. Following similar efforts, and motivated by the ongoing semantic annotation of the NTU-Multilingual Corpus, we decided that the four traditional classes of words present in wordnets were too restrictive. Though non-referential, interjections and classifiers possess interesting semantics features that can be well captured by lexical resources like wordnets. In this paper, we will further ...
The Japanese language has absorbed large numbers of loanwords from many languages, in particular ... more The Japanese language has absorbed large numbers of loanwords from many languages, in particular English. As well as using single loanwords, compound nouns, multiword expressions (MWEs), etc. constructed from loanwords can be found in use in very large quantities. In this paper we describe a system which has been developed to segment Japanese loanword MWEs and construct likely English translations. The system, which leverages the availability of large bilingual dictionaries of loanwords and English n-gram corpora, achieves high levels of accuracy in discriminating between single loanwords and MWEs, and in segmenting MWEs. It also generates useful translations of MWEs, and has the potential to being a major aid to lexicographers in this area.
This research is aimed at developing a hierarchical alternation-based lexical architecture for ma... more This research is aimed at developing a hierarchical alternation-based lexical architecture for machine translation. The proposed architecture makes extensive use of information sharing in describing valency frames through derivational links from base frames, rather than as independent entities. This has advantages in descriptive e‐ciency, robustness and maintainability. The lexicon being developed is built up automatically from the Japanese component of an existing Japanese-English machine translation lexicon. The reconstruction process consists of analysing consistencies in selectional restrictions between valency frames, and postulating alternations where selectional restrictions are preserved on matching case slots; this was found to perform at 60.9% accuracy. All alternation candidates are incorporated into the flnal-version lexicon as derivational links, and expanded out at run time.
In this paper we compare Oxford Lexico and Merriam Webster dictionaries with Princeton WordNet wi... more In this paper we compare Oxford Lexico and Merriam Webster dictionaries with Princeton WordNet with respect to the description of semantic (dis)similarity between polysemous and homonymous senses that could be inferred from them. WordNet lacks any explicit description of polysemy or homonymy, but as a network of linked senses it may be used to compute semantic distances between word senses. To compare WordNet with the dictionaries, we transformed sample entry microstructures of the latter into graphs and cross-linked them with the equivalent senses of the former. We found that dictionaries are in high agreement with each other, if one considers polysemy and homonymy altogether, and in moderate concordance, if one focuses merely on polysemy descriptions. Measuring the shortest path lengths on WordNet gave results comparable to those on the dictionaries in predicting semantic dissimilarity between polysemous senses, but was less felicitous while recognising homonymy.
In this paper we discuss an ongoing effort to enrich students’ learning by involving them in sens... more In this paper we discuss an ongoing effort to enrich students’ learning by involving them in sense tagging. The main goal is to lead students to discover how we can represent meaning and where the limits of our current theories lie. A subsidiary goal is to create sense tagged corpora and an accompanying linked lexicon (in our case wordnets). We present the results of tagging several texts and suggest some ways in which the tagging process could be improved. Two authors of this paper present their own experience as students. Overall, students reported that they found the tagging an enriching experience. The annotated corpora and changes to the wordnet are made available through the NTU multilingual corpus and associated wordnets (NTU-MC).
This paper describes our attempts to add Indonesian definitions to synsets in the Wordnet Bahasa ... more This paper describes our attempts to add Indonesian definitions to synsets in the Wordnet Bahasa (Nurril Hirfana Mohamed Noor et al., 2011; Bond et al., 2014), to extract semantic relations between lemmas and definitions for nouns and verbs, such as synonym, hyponym, hypernym and instance hypernym, and to generally improve Wordnet. The original, somewhat noisy, definitions for Indonesian came from the Asian Wordnet project (Riza et al., 2010). The basic method of extracting the relations is based on Bond et al. (2004). Before the relations can be extracted, the definitions were cleaned up and tokenized. We found that the definitions cannot be completely cleaned up because of many misspellings and bad translations. However, we could identify four semantic relations in 57.10% of noun and verb definitions. For the remaining 42.90%, we propose to add 149 new Indonesian lemmas and make some improvements to Wordnet Bahasa and Wordnet in general.
This paper discusses the approach to multiword expressions being adopted in the LinGO English Res... more This paper discusses the approach to multiword expressions being adopted in the LinGO English Resource Grammar (http://lingo.stanford.edu), a broad-scale bidirectional grammar of English in the HPSG framework. We discuss how the lexicon of multiword expressions is encoded in a database and describe the implications for building a reusable lexical resource.
Past approaches to developing an effective lexicon component in a grammar development environment... more Past approaches to developing an effective lexicon component in a grammar development environment have suffered from a number of usability and efficiency issues. We present a lexical database module currently in use by a number of grammar development projects. The database module presented addresses issues which have caused problems in the past and the power of a database architecture provides a number of practical advantages as well as a solid framework for future extension.
Uploads