Creating Lexical Resources For Endangered Languages: January 2014

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/301404801
Creating Lexical Resources for Endangered Languages
Conference Paper · January 2014

DOI: 10.3115/v1/W14-2207
CITATIONS READS
4 96
3 authors, including:
Khang Lam
Can Tho University
35 PUBLICATIONS 63 CITATIONS
SEE PROFILE
All content following this page was uploaded by Khang Lam on 12 May 2022.
The user has requested enhancement of the downloaded file.

Creating Lexical Resources for Endangered Languages
Khang Nhut Lam, Feras Al Tarouti and Jugal Kalita

Computer Science department
University of Colorado
1420 Austin Bluffs Pkwy, Colorado Springs, CO 80918, USA
{klam2,faltarou,jkalita}@uccs.edu
Abstract learn languages and use them well, tools such as

dictionaries and thesauruses are essential. Dictio-
This paper examines approaches to gener- naries are resources that empower the users and
ate lexical resources for endangered lan- learners of a language. Dictionaries play a more
guages. Our algorithms construct bilin- substantial role than usual for endangered lan-
gual dictionaries and multilingual the- guages and are “an instrument of language main-
sauruses using public Wordnets and a ma- tenance” (Gippert et al., 2006). Thesauruses are
chine translator (MT). Since our work re- resources that group words according to similarity
lies on only one bilingual dictionary be- (Kilgarriff, 2003). For speakers and students of an
tween an endangered language and an “in- endangered language, multilingual thesauruses are
termediate helper” language, it is applica- also likely to be very helpful.
ble to languages that lack many existing This study focuses on examining techniques
resources. that leverage existing resources for “resource-
1 Introduction rich” languages to build lexical resources for low-
resource languages, especially endangered lan-
Languages around the world are becoming extinct guages. The only resource we need is a single
at a record rate. The Ethnologue organization1 re- available bilingual dictionary translating the given
ports 424 languages as nearly extinct and 203 lan- endangered language to English. First, we create a
guages as dormant, out a total of 7,106 recorded reverse dictionary from the input dictionary using
languages. Many other languages are becoming the approach in (Lam and Kalita, 2013). Then, we
endangered, a state which is likely to lead to their generate additional bilingual dictionaries translat-
extinction, without determined intervention. Ac- ing from the given endangered language to sev-
cording to UNESCO, “a language is endangered eral additional languages. Finally, we discuss the
when its speakers cease to use it, use it in fewer first steps to constructing multilingual thesauruses
and fewer domains, use fewer of its registers and encompassing endangered and resources-rich lan-
speaking styles, and/or stop passing it on to the guages. To handle the word sense ambiguity prob-
next generation...”. In America, UNESCO reports lems, we exploit Wordnets in several languages.
134 endangered languages, e.g., Arapaho, Chero- We experiment with two endangered languages:
kee, Cheyenne, Potawatomi and Ute. Cherokee and Cheyenne, and some resource-rich
One of the hallmarks of a living and thriving languages such as English, Finnish, French and
language is the existence and continued produc- Japanese2 . Cherokee is the Iroquoian language
tion of “printed” (now extended to online pres- spoken by 16,000 Cherokee people in Oklahoma
ence) resources such as books, magazines and ed- and North Carolina. Cheyenne is a Native Ameri-
ucational materials in addition to oral traditions. can language spoken by 2,100 Cheyenne people in
There is some effort afoot to document record and Montana and Oklahoma.
archive endangered languages. Documentation The remainder of this paper is organized as fol-
may involve creation of dictionaries, thesauruses, lows. Dictionaries and thesauruses are introduced
text and speech corpora. One possible way to re- in Section 2. Section 3 discusses related work. In
suscitate these languages is to make them more
easily learnable for the younger generation. To 2
ISO 693-3 codes for Cherokee, Cheyenne, English,
Finnish, French and Japanese are chr, chy, eng, fin, fra and
1
http://www.ethnologue.com/ jpn, respectively.
54
Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 54–62,
Baltimore, Maryland, USA, 26 June 2014. c 2014 Association for Computational Linguistics
Section 4 and Section 5, we present approaches of thesauruses are Roget’s international Thesaurus
for creating new bilingual dictionaries and multi- (Roget, 2008), the Open Thesaurus5 or the one at
lingual thesauruses, respectively. Experiments are thesaurus.com.
described in Section 6. Section 7 concludes the We believe that the lexical resources we create
paper. are likely to help endangered languages in sev-
eral ways. These can be educational tools for lan-
2 Dictionaries vs. Thesauruses guage learning within and outside the community
of speakers of the language. The dictionaries and
A dictionary or a lexicon is a book (now, in elec- thesauruses we create can be of help in developing
tronic database formats as well) that consists of a parsers for these languages, in addition to assisting
list of entries sorted by the lexical unit. A lexical machine or human translators to translate rich oral
unit is a word or phrase being defined, also called or possibly limited written traditions of these lan-
definiendum. A dictionary entry or a lexical en- guages into other languages. We may be also able
try simply contains a lexical unit and a definition to construct mini pocket dictionaries for travelers
(Landau, 1984). Given a lexical unit, the defini- and students.
tion associated with it usually contains parts-of-
speech (POS), pronunciations, meanings, exam- 3 Related work
ple sentences showing the use of the source words
and possibly additional information. A monolin- Previous approaches to create new bilingual dic-
gual dictionary contains only one language such tionaries use intermediate dictionaries to find
as The Oxford English Dictionary3 while a bilin- chains of words with the same meaning. Then,
gual dictionary consists of two languages such as several approaches are used to mitigate the ef-
the English-Cheyenne dictionary4 . A lexical entry fect of ambiguity. These include consulting the
in the bilingual dictionary contains a lexical unit in dictionary in the reverse direction (Tanaka and
a source language and equivalent words or multi- Umemura, 1994) and computing ranking scores,
word expressions in the target language along with variously called a semantic score (Bond and
optional additional information. A bilingual dic- Ogura, 2008), an overlapping constraint score, a
tionary may be unidirectional or bidirectional. similarity score (Paik et al., 2004) and a con-
Thesauruses are specialized dictionaries that verse mapping score (Shaw et al., 2013). Other
store synonyms and antonyms of selected words techniques to handle the ambiguity problem are
in a language. Thus, a thesaurus is a resource merging results from several approaches: merging
that groups words according to similarity (Kilgar- candidates from lexical triangulation (Gollins and
riff, 2003). However, a thesaurus is different from Sanderson, 2001), creating a link structure among
a dictionary. (Roget, 1911) describes the orga- words (Ahn and Frampton, 2006) and building
nizes of words in a thesaurus as “... not in alpha- graphs connecting translations of words in sev-
betical order as they are in a dictionary, but ac- eral languages (Mausam et al., 2010). Researchers
cording to the ideas which they express.... The also merge information from several sources such
idea being given, to find the word, or words, by as bilingual dictionaries and corpora (Otero and
which that idea may be most fitly and aptly ex- Campos, 2010) or a Wordnet (István and Shoichi,
pressed. For this purpose, the words and phrases 2009) and (Lam and Kalita, 2013). Some re-
of the language are here classed, not according to searchers also extract bilingual dictionaries from
their sound or their orthography, but strictly ac- corpora (Ljubešić and Fišer, 2011) and (Bouamor
cording to their signification”. Particularly, a the- et al., 2013). The primary similarity among these
saurus contains a set of descriptors, an indexing methods is that either they work with languages
language, a classification scheme or a system vo- that already possess several lexical resources or
cabulary (Soergel, 1974). A thesaurus also con- these approaches take advantage of related lan-
sists of relationships among descriptors. Each de- guages (that have some lexical resources) by using
scriptor is a term, a notation or another string of such languages as intermediary. The accuracies of
symbols used to designate the concept. Examples bilingual dictionaries created from several avail-
able dictionaries and Wordnets are usually high.
3
http://www.oed.com/ However, it is expensive to create such original
4
http://cdkc.edu/cheyennedictionary/index-
5
english/index.htm http://www.openthesaurus.de/
55
lexical resources and they do not always exist for from just one dictionary Dict(S,I), where S is the
many languages. For instance, we cannot find any endangered source language and I is an “inter-
Wordnet for chr or chy. In addition, these exist- mediate helper” language. We require that the
ing approaches can only generate one or just a few language I has an available Wordnet linked to
new bilingual dictionaries from at least two exist- the Princeton Wordnet (PWN) (Fellbaum, 1998).
ing bilingual dictionaries. Many endangered languages have a bilingual dic-
(Crouch, 1990) clusters documents first using tionary, usually to or from a resource-rich lan-
a complete link clustering algorithm and gener- guage like French or English which is the inter-
ates thesaurus classes or synonym lists based on mediate helper language in our experiments. We
user-supplied parameters such as a threshold sim- make an assumption that we can find only one uni-
ilarity value, number of documents in a cluster, directional bilingual dictionary translating from a
minimum document frequency and specification given endangered language to English.
of a class formation method. (Curran and Moens,
4.1 Generating a reverse bilingual dictionary
2002a) and (Curran and Moens, 2002b) evaluate
performance and efficiency of thesaurus extrac- Given a unidirectional dictionary Dict(S,I) or
tion methods and also propose an approximation Dict(I,S), we reverse the direction of the entries
method that provides for better time complexity to produce Dict(I,S) or Dict(S,I), respectively. We
with little loss in performance accuracy. (Ramírez apply an approach called Direct Reversal with
et al., 2013) develop a multilingual Japanese- Similarity (DRwS), proposed in (Lam and Kalita,
English-Spanish thesaurus using freely available 2013) to create a reverse bilingual dictionary from
resources: Wikipedia and Wordnet. They extract an input dictionary.
translation tuples from Wikipedia from articles in The DRwS approach computes the distance be-
these languages, disambiguate them by mapping tween translations of entries by measuring their se-
to Wordnet senses, and extract a multilingual the- mantic similarity, the so-called simValue. The sim-
saurus with a total of 25,375 entries. Value between two phrases is calculated by com-
One thing to note about all these approaches is paring the similarity of the ExpansionSet for ev-
that they are resource hungry. For example, (Lin, ery word in one phrase with ExpansionSet of ev-
1998) works with a 64-million word English cor- ery word in the other phrase. An ExpansionSet of
pus to produce a high quality thesaurus with about a phrase is a union of the synset, synonym set, hy-
10,000 entries. (Ramírez et al., 2013) has the en- ponym set, and/or hypernym set of every word in
tire Wikipedia at their disposal with millions of it. The synset, synonym, hyponym and hypernym
articles in three languages, although for experi- sets of a word are obtained from PWN. The greater
ments they use only about 13,000 articles in total. is the simValue between two phrases, the more se-
When we work with endangered or low-resource mantically similar are these phrases. According to
languages, we do not have the luxury of collecting (Lam and Kalita, 2013), if the simValue is equal to
such big corpora or accessing even a few thousand or greater than 0.9, the DRwS approach produces
articles from Wikipedia or the entire Web. Many the “best” reverse dictionary.
such languages have no or very limited Web pres- For creating a reverse dictionary, we skip en-
ence. As a result, we have to work with whatever tries with multiword expression in the translation.
limited resources are available. Based on our experiments, we have found that ap-
proach is successful and hence, it may be an effec-
4 Creating new bilingual dictionaries tive way to automatically create a new bilingual
dictionary from an existing one. Figure 1 presents
A dictionary Dict(S,T) between a source language an example of generating entries for the reverse
S and a target language T has a list of entries. Each dictionary.
entry contains a word s in the source language S,
part-of-speech (POS) and one or more translations 4.2 Building bilingual dictionaries to/from
in the target language T. We call such a transla- additional languages
tion t. Thus, a dictionary entry is of the form We propose an approach using public Word-
<si ,POS,ti1 >, <si ,POS,ti2 >, .... nets and MT to create new bilingual dictionaries
This section examines approaches to create new Dict(S,T) from an input dictionary Dict(S,I). As
bilingual dictionaries for endangered languages previously mentioned, I is English in our exper-
56
onyms of words belonging to SY Neng in sev-
eral non-English languages to generate SY NL ,
L ∈ {f in, f ra, jpn}. SY NL in the language L is
extracted from the publicly available Wordnet in
language L linked to the PWN. Next, translation
candidates are generated by translating all words
in SY NL , L ∈ {eng, fin, fra, jpn} to the target
language T using an MT. A translation candidate is
Figure 1: Example of creating entries for a reverse considered a correct translation of the source word
dictionary Dict(eng,chr) from Dict(chr,eng). The in the target language if its rank is greater than a
simValue between the words "ocean" and "sea" is threshold. For each word s, we may have many
0.98, which is greater than the threshold of 0.90. candidates. A translation candidate with a higher
Therefore, the words "ocean" and "sea" in English rank is more likely to become a correct translation
are hypothesized to have both meanings "ame- in the target language. The rank of a candidate is
quohi" and "ustalanali" in Cherokee. We add these computed by dividing its occurrence count by the
entries to Dict(eng, chr). total number of candidates. Figure 3 shows an ex-
ample of creating entries for Dict(chr,vie), where
vie is Vietnamese, from Dict(chr,eng).
iments. Dict(S,T) translates a word in an endan-
gered language S to a word or multiword expres-
sion in a target language T. In particular, we create
bilingual dictionaries for an endangered language
S from a given dictionary Dict(S,eng). Figure 2
presents the approach to create new bilingual dic-
tionaries.
Figure 3: Example of generating new entries for

Dict(chr,vie) from Dict(chr,eng). The word "ayvt-
Figure 2: The approach for creating new bilin- seni" in chr is translated to "throat" in eng. We
gual dictionaries from intermediate Wordnets and find all synonym words for "throat" in English to
a MT. generate SY Neng and all synonyms in fin, fra and
jpn for all words in SY Neng . Then, we translate
For each entry pair (s,e) in a given dictionary all words in all SY NL s to vie and rank them. Ac-
Dict(S,eng), we find all synonym words of the cording to rank calculations, the best translations
word e to create a list of synonym words in En- of "ayvtseni" in chr are the words "cổ họng" and
glish: SY Neng . SY Neng of the word eng is "họng" in vie.
obtained from the PWN. Then, we find all syn-
57
5 Constructing thesauruses JWN to generate SY Neng , SY Nf in , SY Nf ra and
SY Njpn (lines 7-10). The POS of the entry is
As previously mentioned, we want to generate a
the POS extracted from the Offset-POS (line 5).
multilingual thesaurus THS composed of endan-
Since these Wordnets are aligned, a specific offset-
gered and resource-rich languages. For example,
POS retrieves synsets that are equivalent sense-
we build the thesaurus encompassing an endan-
wise. Then, we translate all SY NL s to the given
gered language S and eng, fin, fra and jpn. Our
endangered language S using bilingual dictionar-
thesaurus contains a list of entries. Every entry has
ies we created in the previous section (lines 11-
a unique ID. Each entry is a 7-tuple: ID, SY NS ,
14). Finally, we rank translation candidates and
SY Neng , SY Nf in , SY Nf ra , SY Njpn and POS.
add the correct translations to SY NS (lines 15-
Each SY NL contains words that have the same
19). The rank of a candidate is computed by di-
sense in language L. All SY NL , L ∈ {S, eng, fin,
viding its occurrence count by the total number of
fra, jpn} with the same ID have the same sense.
candidates. If a candidate has a rank value greater
This section presents the initial steps in con-
than a threshold, we accept it as a correct transla-
structing multilingual thesauruses using Wordnets
tion and add it to SY NS .
and the bilingual dictionaries we create. The
approach to create a multilingual thesaurus en- Algorithm 1
compassing an endangered language and several Input: Endangered language S, PWN, FWN,
resource-rich languages is presented in Figure 4 WWN, JWN, Dict(eng,S), Dict(fin,S), Dict(fra,S)
and Algorithm 1. and Dict(jpn,S)
Output: thesaurus THS
1: ID:=0
2: for all offset-POSs in PWN do
3: ID++
4: candidates := φ
5: P OS=extract(offset-POS)
6: SY NS := φ
7: SY Neng =extract(offset-POS, PWN)
8: SY Nf in =extract(offset-POS, FWN)
9: SY Nf ra =extract(offset-POS, WWN)
10: SY Njpn =extract(offset-POS, JWN)
11: candidates+=translate(SY Neng ,S)
12: candidates+=translate(SY Nf in ,S)
13: candidates+=translate(SY Nf ra ,S)
14: candidates+=translate(SY Njpn ,S)
15: for all candidate in candidates do
16: if rank(candidate) > α then
Figure 4: The approach to construct a multilingual 17: add(candidate,SY NS )
thesaurus encompassing an endangered language 18: end if
S and resource-rich language. 19: end for
20: add ID, POS and all SY NL into THS
First, we extract SY NL in resource-rich lan- 21: end for
guages from Wordnets. To extract SY Neng ,
SY Nf in , SY Nf ra and SY Njpn , we use PWN Figure 5 presents an example of creating an en-
and Wordnets linked to the PWN provided by try for the thesaurus. We generate entries for the
the Open Multilingual Wordnet6 project (Bond multilingual thesaurus encompassing of Cherokee,
and Foster, 2013): FinnWordnet (FWN) (Lindén, English, Finnish, French and Japanese.
2010), WOLF (WWN) (Sagot and Fišer, 2008) We extract words belonging to offset-POS
and JapaneseWordnet (JWN) (Isahara et al., "09426788-n" in PWN, FWN, WWN and JWN
2008). For each Offset-POS, we extract its cor- and add them into corresponding SY NL . The
responding synsets from PWN, FWN, WWN and POS of this entry is "n", which is a "noun".
6
http://compling.hss.ntu.edu.sg/omw/ Next, we use the bilingual dictionaries we cre-
58
6.1 Datasets used
We start with two bilingual dictionaries:
Dict(chr,eng)7 and Dict(chy,eng)8 that we
obtain from Web pages. These are unidirectional
bilingual dictionaries. The numbers of entries
in Dict(chr,eng) and Dict(chy,eng) are 3,199
and 28,097, respectively. For entries in these
input dictionaries without POS information, our
algorithm chooses the best POS of the English
word, which may lead to wrong translations. The
Microsoft Translator Java API9 is used as another
main resource. We were given free access to this
API. We could not obtain free access to the API
for the Google Translator.
The synonym lexicons are the synsets of PWN,
FWN, JWN and WWN. Table 1 provides some de-
tails of the Wordnets used.
Figure 5: Example of generating an entry in the Wordnet Synsets Core

multilingual thesaurus encompassing Cherokee, JWN 57,179 95%
English, Finnish, French and Japanese. FWN 116,763 100%
PWN 117,659 100%
WWN 59,091 92%
ated to translate all words in SY Neng , SY Nf in ,
SY Nf ra , SY Njpn to the given endangered lan- Table 1: The number of synsets in the Wordnets
guage, Cherokee, and rank them. According to the linked to PWN 3.0 are obtained from the Open
rank calculations, the best Cherokee translation is Multilingual Wordnet, along with the percentage
the word “ustalanali”. The new entry added to the of synsets covered from the semi-automatically
multilingual thesaurus is presented in Figure 6. compiled list of 5,000 "core" word senses in PWN.
Note that synsets which are not linked to the PWN
are not taken into account.
6.2 Creating reverse bilingual dictionaries

From Dict(chr,eng) and Dict(chy,eng), we create
two reverse bilingual dictionaries Dict(eng,chr)
with 3,538 entries and Dict(eng,chy) with 28,072
Figure 6: An entry of the multilingual thesaurus entries
encompassing Cherokee, English, Finnish, French Next, we reverse the reverse dictionaries we
and Japanese. produce to generate new reverse of the reverse
(RR) dictionaries, then integrate the RR dictio-
naries with the input dictionaries to improve the
6 Experimental results sizes of dictionaries. During the process of gen-
erating new reverse dictionaries, we already com-
Ideally, evaluation should be performed by volun- puted the semantic similarity values among words
teers who are fluent in both source and destination to find words with the same meanings. We use a
languages. However, for evaluating created dic- simple approach called the Direct Reversal (DR)
tionaries and thesauruses, we could not recruit any approach in (Lam and Kalita, 2013) to create
individuals who are experts in two corresponding 7
http://www.manataka.org/page122.html
languages. We are in the process of finding vol- 8
http://www.cdkc.edu/cheyennedictionary/index-
unteers who are fluent in both languages for some english/index.htm
selected resources we create. 9
https://datamarket.azure.com/dataset/bing/microsofttranslator
59
these RR dictionaries. To create a reverse dictio- Dictionary Entries Dictionary Entries
nary Dict(T,S), the DR approach takes each entry chr-arb 2,623 chr-cat 2,639
<s,POS,t> in the input dictionary Dict(S,T) and chr-cht 2,607 chr-dan 2,655
simply swaps the positions of s and t. The new chr-deu 2,629 chr-mww 2,694
entry <t,POS,s> is added into Dict(T,S). Figure 7 chr-ind 2,580 chr-zlm 2,633
presents an example. chr-spa 2,607 chr-tha 2,645
chr-vie 2,618 chy-arb 10,604
chy-cat 10,748 chy-cht 10,538
chy-dan 10,654 chy-deu 10,708
chy-mww 10,790 chy-ind 10,434
chy-zlm 10,690 chy-spa 10,580
chy-tha 10,696 chy-vie 10,848
Table 2: The number of entries in some dictionar-

ies we create.
6.4 Creating multilingual thesauruses

We construct two multilingual thesauruses:
Figure 7: Given a dictionary Dict(chy,eng), we T HS1 (chr, eng, fin, fra, jpn) and T HS2 (chy, eng,
create a new Dict(eng,chy) using the DRwS ap- fin, fra, jpn). The number of entries in T HS1
proach of (Lam and Kalita, 2013). Then, we create and T HS2 are 5,073 and 10,046, respectively.
a new Dict(chy,eng) using the DR approach from These thesauruses we construct contain words
the created dictionary Dict(eng,chy). Finally, we with rank values above the average. A similar
integrate the generated dictionary Dict(chy,eng) approach used to create Wordnet synsets (Lam
with the input dictionary Dict(chy,eng) to create a et al., 2014) has produced excellent results. We
new dictionary Dict(chy,eng) with a greater num- believe that our thesauruses reported in this paper
ber of entries are of acceptable quality.
6.5 How to evaluate

The number of entries in the integrated dictio-
naries Dict(chr,eng) and Dict(chy,eng) are 3,618 Currently, we are not able to evaluate the dictio-
and 47,529, respectively. Thus, the number of en- naries and thesauruses we create. In the future, we
tries in the original dictionaries have "magically" expect to evaluate our work using two methods.
increased by 13.1% and 69.21%, respectively. First, we will use the standard approach which is
human evaluation to evaluate resources as previ-
ously mentioned. Second, we will try to find an
6.3 Creating additional bilingual dictionaries additional bilingual dictionary translating from an
We can create dictionaries from chr or chy to endangered language S (viz., chr or chy) to another
any non-eng language supported by the Microsoft “resource-rich” non-English language (viz., fin or
Translator, e.g., Arabic (arb), Chinese (cht), Cata- fra), then, create a new dictionary translating from
lan (cat), Danish (dan), German (deu), Hmong S to English using the approaches we have intro-
Daw (mww), Indonesian (ind), Malay (zlm), Thai duced. We plan to evaluate the new dictionary we
(tha), Spanish (spa) and vie. Table 2 presents the create, say Dict(chr,eng) against the existing dic-
number of entries in the dictionaries we create. tionary Dict(chr,eng).
These dictionaries contain translations only with
the highest ranks for each word.
7 Conclusion and future work
Although we have not evaluated entries in the We examine approaches to create bilingual dictio-
particular dictionaries in Table 1, evaluation of naries and thesauruses for endangered languages
dictionaries with non-endangered languages, but from only one input dictionary, publicly avail-
using the same approach, we have confidence that able Wordnets and an MT. Taking advantage of
these dictionaries are of acceptable, if not very available Wordnets linked to the PWN helps re-
good quality. duce ambiguities in dictionaries we create. We
60
run experiments with two endangered languages: Francis Bond and Kentaro Ogura. 2008 Combin-
Cherokee and Cheyenne. We have also experi- ing linguistic resources to create a machine-tractable
Japanese-Malay dictionary. Language Resources
mented with two additional endangered languages
and Evaluation, 42(2): 127–136.
from Northeast India: Dimasa and Karbi, spo-
ken by about 115,000 and 492,000 people, respec- Francis Bond and Ryan Foster. 2013. Linking and
tively. We believe that our research has the po- extending an open multilingual Wordnet. In Pro-
ceedings of 51st Annual Meeting of the Association
tential to increase the number of lexical resources for Computational Linguistics (ACL 2013), pages
for languages which do not have many existing re- 1352–1362, Sofia, Bulgaria, August.
sources to begin with. We are in the process of
Hitoshi Isahara, Francis Bond, Kiyotaka Uchimoto,
creating reverse dictionaries from bilingual dictio- Masao Utiyama and Kyoko Kanzaki. 2008. De-
naries we have already created. We are also in velopment of Japanese Wordnet. In Proceedings
the process of creating a Website where all dic- of 6th International Conference on Language Re-
tionaries and thesauruses we create will be avail- sources and Evaluation (LREC 2008), pages 2420–
2423, Marrakech, Moroco, May.
able, along with a user friendly interface to dis-
seminate these resources to the wider public as James R. Curran and Marc Moens. 2002a. Scaling
well as to obtain feedback on individual entries. context space. In Proceedings of the 40th Annual
We will solicit feedback from communities that Meeting of Association for Computational Linguis-
tics (ACL 2002), pages 231–238, Philadelphia, USA,
use the languages as mother-tongues. Our goal July.
will be to use this feedback to improve the qual-
ity of the dictionaries and thesauruses. Some of James R. Curran and Marc Moens. 2002b. Improve-
ments in automatic thesaurus extraction, In Pro-
resources we created can be downloaded from ceedings of the Workshop on Unsupervised lexical
http://cs.uccs.edu/∼linclab/projects.html acquisition (Volume 9), pages 59–66, Philadelphia,
USA, July. Association for Computational Linguis-
tics.
References
Jessica Ramírez, Masayuki Asahara and Yuji Mat-
Adam Kilgarriff. 2003. Thesauruses for natu- sumoto. 2013. Japanese-Spanish thesaurus con-
ral language processing. In Proceedings of the struction using English as a pivot. arXiv preprint
Joint Conference on Natural Language Processing arXiv:1303.1232.
and Knowledge Engineering, pages 5–13, Beijing,
China, October. Jost Gippert, Nikolaus Himmelmann and Ulrike Mosel,
eds. 2006. Essentials of Lnguage Documenta-
Benoit Sagot and Darja Fišer. 2008. Building a free tion. Vol. 178, Walter de Gruyter GmbH & Co. KG,
French Wordnet from multilingual resources. In Berlin, Germany.
Proceedings of OntoLex, Marrakech, Morocco.
Khang N. Lam and Jugal Kalita. 2013. Creating re-
Carolyn J. Crouch 1990. An approach to the auto- verse bilingual dictionaries. In Proceedings of the
matic construction of global thesauri, Information Conference of the North American Chapter of the
Processing & Management, 26(5): 629–640. Association for Computational Linguistics: Human
Christiane Fellbaum. 1998. Wordnet: An Electronic Language Technologies (NAACL-HLT), pages 524–
Lexical Database. MIT Press, Cambridge, Mas- 528, Atlanta, USA, June.
sachusetts, USA.
Khang N. Lam, Feras A. Tarouti and Jugal Kalita.
Dagobert Soergel. 1974. Indexing languages and the- 2014. Automatically constructing Wordnet synsets.
sauri: construction and maintenance. Melville Pub- To appear at the 52nd Annual Meeting of the Asso-
lishing Company, Los Angeles, California. ciation for Computational Linguistics (ACL 2014),
Baltimore, USA, June.
Dhouha Bouamor, Nasredine Semmar and Pierre
Zweigenbaum. 2013 Using Wordnet and Semantic Kisuh Ahn and Matthew Frampton. 2006. Automatic
Similarity for Bilingual Terminology Mining from generation of translation dictionaries using interme-
Comparable Corpora. In Proceedings of the 6th diary languages. In Proceedings of the Interna-
Workshop on Building and Using Comparable Cor- tional Workshop on Cross-Language Knowledge In-
pora, pages 16–23, Sofia, Bulgaria, August. Associ- duction, pages 41–44, Trento, Italy, April. European
ation for Computational Linguistics. Chapter of the Association for Computational Lin-
guistics.
Dekang Lin. 1998. Automatic retrieval and cluster-
ing of similar words. In Proceedings of the 17th In- Krister Lindén and Lauri Carlson 2010. FinnWordnet -
ternational Conference on Computational Linguis- WordNet påfinska via översättning, LexicoNordica.
tics (Volume 2), pages 768–774, Montreal, Quebec, Nordic Journal of Lexicography (Volume 17), pages
Canada. 119–140.
61
Kumiko Tanaka and Kyoji Umemura. 1994. Construc-
tion of bilingual dictionary intermediated by a third
language. In Proceedings of the 15th Conference on
Computational linguistics (COLING 1994), Volume
1, pages 297–303, Kyoto, Japan, August. Associa-
tion for Computational Linguistics.
Kyonghee Paik, Satoshi Shirai and Hiromi Nakaiwa.
2004. Automatic construction of a transfer dictio-
nary considering directionality. In Proceedings of
the Workshop on Multilingual Linguistic Resources,
pages 31–38, Geneva, Switzerland, August . Asso-
ciation for Computational Linguistics.
Mausam, Stephen Soderland, Oren Etzioni, Daniel S.
Weld, Kobi Reiter, Michael Skinner, Marcus Sam-
mer and Jeff Bilmes 2010. Panlingual lexical trans-
lation via probabilistic inference. Artificial Intelli-
gence, 174(2010): 619–637.
Nikola Ljubešić and Darja Fišer. 2011. Bootstrap-
ping bilingual lexicons from comparable corpora for
closely related languages. In Proceedings of the
14th International Conference on Text, Speech and
Dialogue (TSD 2011), pages 91–98. Plzeň, Czech
Republic, September.
Pablo G. Otero and José R.P. Campos. 2010. Auto-
matic generation of bilingual dictionaries using in-
termediate languages and comparable corpora. In
Proceedings of the 11th International Conference on
Computational Linguistic and Intelligent Text Pro-
cessing (CICLing’10 ), pages 473–483, Ias˛i, Roma-
nia, March.
Peter M. Roget. 1911. Roget’s Thesaurus of English
Words and Phrases.... Thomas Y. Crowell Com-
pany, New York, USA.
Peter M. Roget. 2008. Roget’s International The-
saurus, 3rd Edition. Oxford & IBH Publishing
Company Pvt, New Delhi, India.
Ryan Shaw, Anindya Datta, Debra VanderMeer and
Kaushik Datta. 2013. Building a scalable database
- Driven Reverse Dictionary. IEEE Transactions on
Knowledge and Data Engineering, 25(3): 528–540.
Sidney I. Landau 1984. Dictionaries: the art and
craft of lexicography. Charles Scribner’s Sons, New
York, USA.
Tim Gollins and Mark Sanderson. 2001. Improving
cross language information retrieval with triangu-
lated translation. In Proceedings of the 24th Annual
International ACM SIGIR Conference on Research
and Development in Information Retrieval, pages
90–95, New Orleans, Louisiana, USA, September.
Varga István and Yokoyama Shoichi. 2009. Bilin-
gual dictionary generation for low-resourced lan-
guage pairs. In Proceedings of the 2009 Confer-
ence on Empirical Methods in Natural Language
Processing (Volume 2), pages 862–870, Singapore,
August. Association for Computational Linguistics.
62
View publication stats

Creating Lexical Resources For Endangered Languages: January 2014

Uploaded by

Copyright:

Available Formats

Creating Lexical Resources For Endangered Languages: January 2014

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Creating Lexical Resources For Endangered Languages: January 2014

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Creating Lexical Resources for Endangered Languages

Conference Paper · January 2014

The user has requested enhancement of the downloaded file.

Khang Nhut Lam, Feras Al Tarouti and Jugal Kalita

Abstract learn languages and use them well, tools such as

Figure 3: Example of generating new entries for

Figure 5: Example of generating an entry in the Wordnet Synsets Core

6.2 Creating reverse bilingual dictionaries

Table 2: The number of entries in some dictionar-

6.4 Creating multilingual thesauruses

6.5 How to evaluate

View publication stats

You might also like