Creating Lexical Resources For Endangered Languages: January 2014
Creating Lexical Resources For Endangered Languages: January 2014
Creating Lexical Resources For Endangered Languages: January 2014
net/publication/301404801
CITATIONS READS
4 96
3 authors, including:
Khang Lam
Can Tho University
35 PUBLICATIONS 63 CITATIONS
SEE PROFILE
All content following this page was uploaded by Khang Lam on 12 May 2022.
54
Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 54–62,
Baltimore, Maryland, USA, 26 June 2014. c 2014 Association for Computational Linguistics
Section 4 and Section 5, we present approaches of thesauruses are Roget’s international Thesaurus
for creating new bilingual dictionaries and multi- (Roget, 2008), the Open Thesaurus5 or the one at
lingual thesauruses, respectively. Experiments are thesaurus.com.
described in Section 6. Section 7 concludes the We believe that the lexical resources we create
paper. are likely to help endangered languages in sev-
eral ways. These can be educational tools for lan-
2 Dictionaries vs. Thesauruses guage learning within and outside the community
of speakers of the language. The dictionaries and
A dictionary or a lexicon is a book (now, in elec- thesauruses we create can be of help in developing
tronic database formats as well) that consists of a parsers for these languages, in addition to assisting
list of entries sorted by the lexical unit. A lexical machine or human translators to translate rich oral
unit is a word or phrase being defined, also called or possibly limited written traditions of these lan-
definiendum. A dictionary entry or a lexical en- guages into other languages. We may be also able
try simply contains a lexical unit and a definition to construct mini pocket dictionaries for travelers
(Landau, 1984). Given a lexical unit, the defini- and students.
tion associated with it usually contains parts-of-
speech (POS), pronunciations, meanings, exam- 3 Related work
ple sentences showing the use of the source words
and possibly additional information. A monolin- Previous approaches to create new bilingual dic-
gual dictionary contains only one language such tionaries use intermediate dictionaries to find
as The Oxford English Dictionary3 while a bilin- chains of words with the same meaning. Then,
gual dictionary consists of two languages such as several approaches are used to mitigate the ef-
the English-Cheyenne dictionary4 . A lexical entry fect of ambiguity. These include consulting the
in the bilingual dictionary contains a lexical unit in dictionary in the reverse direction (Tanaka and
a source language and equivalent words or multi- Umemura, 1994) and computing ranking scores,
word expressions in the target language along with variously called a semantic score (Bond and
optional additional information. A bilingual dic- Ogura, 2008), an overlapping constraint score, a
tionary may be unidirectional or bidirectional. similarity score (Paik et al., 2004) and a con-
Thesauruses are specialized dictionaries that verse mapping score (Shaw et al., 2013). Other
store synonyms and antonyms of selected words techniques to handle the ambiguity problem are
in a language. Thus, a thesaurus is a resource merging results from several approaches: merging
that groups words according to similarity (Kilgar- candidates from lexical triangulation (Gollins and
riff, 2003). However, a thesaurus is different from Sanderson, 2001), creating a link structure among
a dictionary. (Roget, 1911) describes the orga- words (Ahn and Frampton, 2006) and building
nizes of words in a thesaurus as “... not in alpha- graphs connecting translations of words in sev-
betical order as they are in a dictionary, but ac- eral languages (Mausam et al., 2010). Researchers
cording to the ideas which they express.... The also merge information from several sources such
idea being given, to find the word, or words, by as bilingual dictionaries and corpora (Otero and
which that idea may be most fitly and aptly ex- Campos, 2010) or a Wordnet (István and Shoichi,
pressed. For this purpose, the words and phrases 2009) and (Lam and Kalita, 2013). Some re-
of the language are here classed, not according to searchers also extract bilingual dictionaries from
their sound or their orthography, but strictly ac- corpora (Ljubešić and Fišer, 2011) and (Bouamor
cording to their signification”. Particularly, a the- et al., 2013). The primary similarity among these
saurus contains a set of descriptors, an indexing methods is that either they work with languages
language, a classification scheme or a system vo- that already possess several lexical resources or
cabulary (Soergel, 1974). A thesaurus also con- these approaches take advantage of related lan-
sists of relationships among descriptors. Each de- guages (that have some lexical resources) by using
scriptor is a term, a notation or another string of such languages as intermediary. The accuracies of
symbols used to designate the concept. Examples bilingual dictionaries created from several avail-
able dictionaries and Wordnets are usually high.
3
http://www.oed.com/ However, it is expensive to create such original
4
http://cdkc.edu/cheyennedictionary/index-
5
english/index.htm http://www.openthesaurus.de/
55
lexical resources and they do not always exist for from just one dictionary Dict(S,I), where S is the
many languages. For instance, we cannot find any endangered source language and I is an “inter-
Wordnet for chr or chy. In addition, these exist- mediate helper” language. We require that the
ing approaches can only generate one or just a few language I has an available Wordnet linked to
new bilingual dictionaries from at least two exist- the Princeton Wordnet (PWN) (Fellbaum, 1998).
ing bilingual dictionaries. Many endangered languages have a bilingual dic-
(Crouch, 1990) clusters documents first using tionary, usually to or from a resource-rich lan-
a complete link clustering algorithm and gener- guage like French or English which is the inter-
ates thesaurus classes or synonym lists based on mediate helper language in our experiments. We
user-supplied parameters such as a threshold sim- make an assumption that we can find only one uni-
ilarity value, number of documents in a cluster, directional bilingual dictionary translating from a
minimum document frequency and specification given endangered language to English.
of a class formation method. (Curran and Moens,
4.1 Generating a reverse bilingual dictionary
2002a) and (Curran and Moens, 2002b) evaluate
performance and efficiency of thesaurus extrac- Given a unidirectional dictionary Dict(S,I) or
tion methods and also propose an approximation Dict(I,S), we reverse the direction of the entries
method that provides for better time complexity to produce Dict(I,S) or Dict(S,I), respectively. We
with little loss in performance accuracy. (Ramírez apply an approach called Direct Reversal with
et al., 2013) develop a multilingual Japanese- Similarity (DRwS), proposed in (Lam and Kalita,
English-Spanish thesaurus using freely available 2013) to create a reverse bilingual dictionary from
resources: Wikipedia and Wordnet. They extract an input dictionary.
translation tuples from Wikipedia from articles in The DRwS approach computes the distance be-
these languages, disambiguate them by mapping tween translations of entries by measuring their se-
to Wordnet senses, and extract a multilingual the- mantic similarity, the so-called simValue. The sim-
saurus with a total of 25,375 entries. Value between two phrases is calculated by com-
One thing to note about all these approaches is paring the similarity of the ExpansionSet for ev-
that they are resource hungry. For example, (Lin, ery word in one phrase with ExpansionSet of ev-
1998) works with a 64-million word English cor- ery word in the other phrase. An ExpansionSet of
pus to produce a high quality thesaurus with about a phrase is a union of the synset, synonym set, hy-
10,000 entries. (Ramírez et al., 2013) has the en- ponym set, and/or hypernym set of every word in
tire Wikipedia at their disposal with millions of it. The synset, synonym, hyponym and hypernym
articles in three languages, although for experi- sets of a word are obtained from PWN. The greater
ments they use only about 13,000 articles in total. is the simValue between two phrases, the more se-
When we work with endangered or low-resource mantically similar are these phrases. According to
languages, we do not have the luxury of collecting (Lam and Kalita, 2013), if the simValue is equal to
such big corpora or accessing even a few thousand or greater than 0.9, the DRwS approach produces
articles from Wikipedia or the entire Web. Many the “best” reverse dictionary.
such languages have no or very limited Web pres- For creating a reverse dictionary, we skip en-
ence. As a result, we have to work with whatever tries with multiword expression in the translation.
limited resources are available. Based on our experiments, we have found that ap-
proach is successful and hence, it may be an effec-
4 Creating new bilingual dictionaries tive way to automatically create a new bilingual
dictionary from an existing one. Figure 1 presents
A dictionary Dict(S,T) between a source language an example of generating entries for the reverse
S and a target language T has a list of entries. Each dictionary.
entry contains a word s in the source language S,
part-of-speech (POS) and one or more translations 4.2 Building bilingual dictionaries to/from
in the target language T. We call such a transla- additional languages
tion t. Thus, a dictionary entry is of the form We propose an approach using public Word-
<si ,POS,ti1 >, <si ,POS,ti2 >, .... nets and MT to create new bilingual dictionaries
This section examines approaches to create new Dict(S,T) from an input dictionary Dict(S,I). As
bilingual dictionaries for endangered languages previously mentioned, I is English in our exper-
56
onyms of words belonging to SY Neng in sev-
eral non-English languages to generate SY NL ,
L ∈ {f in, f ra, jpn}. SY NL in the language L is
extracted from the publicly available Wordnet in
language L linked to the PWN. Next, translation
candidates are generated by translating all words
in SY NL , L ∈ {eng, fin, fra, jpn} to the target
language T using an MT. A translation candidate is
Figure 1: Example of creating entries for a reverse considered a correct translation of the source word
dictionary Dict(eng,chr) from Dict(chr,eng). The in the target language if its rank is greater than a
simValue between the words "ocean" and "sea" is threshold. For each word s, we may have many
0.98, which is greater than the threshold of 0.90. candidates. A translation candidate with a higher
Therefore, the words "ocean" and "sea" in English rank is more likely to become a correct translation
are hypothesized to have both meanings "ame- in the target language. The rank of a candidate is
quohi" and "ustalanali" in Cherokee. We add these computed by dividing its occurrence count by the
entries to Dict(eng, chr). total number of candidates. Figure 3 shows an ex-
ample of creating entries for Dict(chr,vie), where
vie is Vietnamese, from Dict(chr,eng).
iments. Dict(S,T) translates a word in an endan-
gered language S to a word or multiword expres-
sion in a target language T. In particular, we create
bilingual dictionaries for an endangered language
S from a given dictionary Dict(S,eng). Figure 2
presents the approach to create new bilingual dic-
tionaries.
57
5 Constructing thesauruses JWN to generate SY Neng , SY Nf in , SY Nf ra and
SY Njpn (lines 7-10). The POS of the entry is
As previously mentioned, we want to generate a
the POS extracted from the Offset-POS (line 5).
multilingual thesaurus THS composed of endan-
Since these Wordnets are aligned, a specific offset-
gered and resource-rich languages. For example,
POS retrieves synsets that are equivalent sense-
we build the thesaurus encompassing an endan-
wise. Then, we translate all SY NL s to the given
gered language S and eng, fin, fra and jpn. Our
endangered language S using bilingual dictionar-
thesaurus contains a list of entries. Every entry has
ies we created in the previous section (lines 11-
a unique ID. Each entry is a 7-tuple: ID, SY NS ,
14). Finally, we rank translation candidates and
SY Neng , SY Nf in , SY Nf ra , SY Njpn and POS.
add the correct translations to SY NS (lines 15-
Each SY NL contains words that have the same
19). The rank of a candidate is computed by di-
sense in language L. All SY NL , L ∈ {S, eng, fin,
viding its occurrence count by the total number of
fra, jpn} with the same ID have the same sense.
candidates. If a candidate has a rank value greater
This section presents the initial steps in con-
than a threshold, we accept it as a correct transla-
structing multilingual thesauruses using Wordnets
tion and add it to SY NS .
and the bilingual dictionaries we create. The
approach to create a multilingual thesaurus en- Algorithm 1
compassing an endangered language and several Input: Endangered language S, PWN, FWN,
resource-rich languages is presented in Figure 4 WWN, JWN, Dict(eng,S), Dict(fin,S), Dict(fra,S)
and Algorithm 1. and Dict(jpn,S)
Output: thesaurus THS
1: ID:=0
2: for all offset-POSs in PWN do
3: ID++
4: candidates := φ
5: P OS=extract(offset-POS)
6: SY NS := φ
7: SY Neng =extract(offset-POS, PWN)
8: SY Nf in =extract(offset-POS, FWN)
9: SY Nf ra =extract(offset-POS, WWN)
10: SY Njpn =extract(offset-POS, JWN)
11: candidates+=translate(SY Neng ,S)
12: candidates+=translate(SY Nf in ,S)
13: candidates+=translate(SY Nf ra ,S)
14: candidates+=translate(SY Njpn ,S)
15: for all candidate in candidates do
16: if rank(candidate) > α then
Figure 4: The approach to construct a multilingual 17: add(candidate,SY NS )
thesaurus encompassing an endangered language 18: end if
S and resource-rich language. 19: end for
20: add ID, POS and all SY NL into THS
First, we extract SY NL in resource-rich lan- 21: end for
guages from Wordnets. To extract SY Neng ,
SY Nf in , SY Nf ra and SY Njpn , we use PWN Figure 5 presents an example of creating an en-
and Wordnets linked to the PWN provided by try for the thesaurus. We generate entries for the
the Open Multilingual Wordnet6 project (Bond multilingual thesaurus encompassing of Cherokee,
and Foster, 2013): FinnWordnet (FWN) (Lindén, English, Finnish, French and Japanese.
2010), WOLF (WWN) (Sagot and Fišer, 2008) We extract words belonging to offset-POS
and JapaneseWordnet (JWN) (Isahara et al., "09426788-n" in PWN, FWN, WWN and JWN
2008). For each Offset-POS, we extract its cor- and add them into corresponding SY NL . The
responding synsets from PWN, FWN, WWN and POS of this entry is "n", which is a "noun".
6
http://compling.hss.ntu.edu.sg/omw/ Next, we use the bilingual dictionaries we cre-
58
6.1 Datasets used
We start with two bilingual dictionaries:
Dict(chr,eng)7 and Dict(chy,eng)8 that we
obtain from Web pages. These are unidirectional
bilingual dictionaries. The numbers of entries
in Dict(chr,eng) and Dict(chy,eng) are 3,199
and 28,097, respectively. For entries in these
input dictionaries without POS information, our
algorithm chooses the best POS of the English
word, which may lead to wrong translations. The
Microsoft Translator Java API9 is used as another
main resource. We were given free access to this
API. We could not obtain free access to the API
for the Google Translator.
The synonym lexicons are the synsets of PWN,
FWN, JWN and WWN. Table 1 provides some de-
tails of the Wordnets used.
59
these RR dictionaries. To create a reverse dictio- Dictionary Entries Dictionary Entries
nary Dict(T,S), the DR approach takes each entry chr-arb 2,623 chr-cat 2,639
<s,POS,t> in the input dictionary Dict(S,T) and chr-cht 2,607 chr-dan 2,655
simply swaps the positions of s and t. The new chr-deu 2,629 chr-mww 2,694
entry <t,POS,s> is added into Dict(T,S). Figure 7 chr-ind 2,580 chr-zlm 2,633
presents an example. chr-spa 2,607 chr-tha 2,645
chr-vie 2,618 chy-arb 10,604
chy-cat 10,748 chy-cht 10,538
chy-dan 10,654 chy-deu 10,708
chy-mww 10,790 chy-ind 10,434
chy-zlm 10,690 chy-spa 10,580
chy-tha 10,696 chy-vie 10,848
60
run experiments with two endangered languages: Francis Bond and Kentaro Ogura. 2008 Combin-
Cherokee and Cheyenne. We have also experi- ing linguistic resources to create a machine-tractable
Japanese-Malay dictionary. Language Resources
mented with two additional endangered languages
and Evaluation, 42(2): 127–136.
from Northeast India: Dimasa and Karbi, spo-
ken by about 115,000 and 492,000 people, respec- Francis Bond and Ryan Foster. 2013. Linking and
tively. We believe that our research has the po- extending an open multilingual Wordnet. In Pro-
ceedings of 51st Annual Meeting of the Association
tential to increase the number of lexical resources for Computational Linguistics (ACL 2013), pages
for languages which do not have many existing re- 1352–1362, Sofia, Bulgaria, August.
sources to begin with. We are in the process of
Hitoshi Isahara, Francis Bond, Kiyotaka Uchimoto,
creating reverse dictionaries from bilingual dictio- Masao Utiyama and Kyoko Kanzaki. 2008. De-
naries we have already created. We are also in velopment of Japanese Wordnet. In Proceedings
the process of creating a Website where all dic- of 6th International Conference on Language Re-
tionaries and thesauruses we create will be avail- sources and Evaluation (LREC 2008), pages 2420–
2423, Marrakech, Moroco, May.
able, along with a user friendly interface to dis-
seminate these resources to the wider public as James R. Curran and Marc Moens. 2002a. Scaling
well as to obtain feedback on individual entries. context space. In Proceedings of the 40th Annual
We will solicit feedback from communities that Meeting of Association for Computational Linguis-
tics (ACL 2002), pages 231–238, Philadelphia, USA,
use the languages as mother-tongues. Our goal July.
will be to use this feedback to improve the qual-
ity of the dictionaries and thesauruses. Some of James R. Curran and Marc Moens. 2002b. Improve-
ments in automatic thesaurus extraction, In Pro-
resources we created can be downloaded from ceedings of the Workshop on Unsupervised lexical
http://cs.uccs.edu/∼linclab/projects.html acquisition (Volume 9), pages 59–66, Philadelphia,
USA, July. Association for Computational Linguis-
tics.
References
Jessica Ramírez, Masayuki Asahara and Yuji Mat-
Adam Kilgarriff. 2003. Thesauruses for natu- sumoto. 2013. Japanese-Spanish thesaurus con-
ral language processing. In Proceedings of the struction using English as a pivot. arXiv preprint
Joint Conference on Natural Language Processing arXiv:1303.1232.
and Knowledge Engineering, pages 5–13, Beijing,
China, October. Jost Gippert, Nikolaus Himmelmann and Ulrike Mosel,
eds. 2006. Essentials of Lnguage Documenta-
Benoit Sagot and Darja Fišer. 2008. Building a free tion. Vol. 178, Walter de Gruyter GmbH & Co. KG,
French Wordnet from multilingual resources. In Berlin, Germany.
Proceedings of OntoLex, Marrakech, Morocco.
Khang N. Lam and Jugal Kalita. 2013. Creating re-
Carolyn J. Crouch 1990. An approach to the auto- verse bilingual dictionaries. In Proceedings of the
matic construction of global thesauri, Information Conference of the North American Chapter of the
Processing & Management, 26(5): 629–640. Association for Computational Linguistics: Human
Christiane Fellbaum. 1998. Wordnet: An Electronic Language Technologies (NAACL-HLT), pages 524–
Lexical Database. MIT Press, Cambridge, Mas- 528, Atlanta, USA, June.
sachusetts, USA.
Khang N. Lam, Feras A. Tarouti and Jugal Kalita.
Dagobert Soergel. 1974. Indexing languages and the- 2014. Automatically constructing Wordnet synsets.
sauri: construction and maintenance. Melville Pub- To appear at the 52nd Annual Meeting of the Asso-
lishing Company, Los Angeles, California. ciation for Computational Linguistics (ACL 2014),
Baltimore, USA, June.
Dhouha Bouamor, Nasredine Semmar and Pierre
Zweigenbaum. 2013 Using Wordnet and Semantic Kisuh Ahn and Matthew Frampton. 2006. Automatic
Similarity for Bilingual Terminology Mining from generation of translation dictionaries using interme-
Comparable Corpora. In Proceedings of the 6th diary languages. In Proceedings of the Interna-
Workshop on Building and Using Comparable Cor- tional Workshop on Cross-Language Knowledge In-
pora, pages 16–23, Sofia, Bulgaria, August. Associ- duction, pages 41–44, Trento, Italy, April. European
ation for Computational Linguistics. Chapter of the Association for Computational Lin-
guistics.
Dekang Lin. 1998. Automatic retrieval and cluster-
ing of similar words. In Proceedings of the 17th In- Krister Lindén and Lauri Carlson 2010. FinnWordnet -
ternational Conference on Computational Linguis- WordNet påfinska via översättning, LexicoNordica.
tics (Volume 2), pages 768–774, Montreal, Quebec, Nordic Journal of Lexicography (Volume 17), pages
Canada. 119–140.
61
Kumiko Tanaka and Kyoji Umemura. 1994. Construc-
tion of bilingual dictionary intermediated by a third
language. In Proceedings of the 15th Conference on
Computational linguistics (COLING 1994), Volume
1, pages 297–303, Kyoto, Japan, August. Associa-
tion for Computational Linguistics.
Kyonghee Paik, Satoshi Shirai and Hiromi Nakaiwa.
2004. Automatic construction of a transfer dictio-
nary considering directionality. In Proceedings of
the Workshop on Multilingual Linguistic Resources,
pages 31–38, Geneva, Switzerland, August . Asso-
ciation for Computational Linguistics.
Mausam, Stephen Soderland, Oren Etzioni, Daniel S.
Weld, Kobi Reiter, Michael Skinner, Marcus Sam-
mer and Jeff Bilmes 2010. Panlingual lexical trans-
lation via probabilistic inference. Artificial Intelli-
gence, 174(2010): 619–637.
Nikola Ljubešić and Darja Fišer. 2011. Bootstrap-
ping bilingual lexicons from comparable corpora for
closely related languages. In Proceedings of the
14th International Conference on Text, Speech and
Dialogue (TSD 2011), pages 91–98. Plzeň, Czech
Republic, September.
Pablo G. Otero and José R.P. Campos. 2010. Auto-
matic generation of bilingual dictionaries using in-
termediate languages and comparable corpora. In
Proceedings of the 11th International Conference on
Computational Linguistic and Intelligent Text Pro-
cessing (CICLing’10 ), pages 473–483, Ias˛i, Roma-
nia, March.
Peter M. Roget. 1911. Roget’s Thesaurus of English
Words and Phrases.... Thomas Y. Crowell Com-
pany, New York, USA.
Peter M. Roget. 2008. Roget’s International The-
saurus, 3rd Edition. Oxford & IBH Publishing
Company Pvt, New Delhi, India.
Ryan Shaw, Anindya Datta, Debra VanderMeer and
Kaushik Datta. 2013. Building a scalable database
- Driven Reverse Dictionary. IEEE Transactions on
Knowledge and Data Engineering, 25(3): 528–540.
Sidney I. Landau 1984. Dictionaries: the art and
craft of lexicography. Charles Scribner’s Sons, New
York, USA.
Tim Gollins and Mark Sanderson. 2001. Improving
cross language information retrieval with triangu-
lated translation. In Proceedings of the 24th Annual
International ACM SIGIR Conference on Research
and Development in Information Retrieval, pages
90–95, New Orleans, Louisiana, USA, September.
Varga István and Yokoyama Shoichi. 2009. Bilin-
gual dictionary generation for low-resourced lan-
guage pairs. In Proceedings of the 2009 Confer-
ence on Empirical Methods in Natural Language
Processing (Volume 2), pages 862–870, Singapore,
August. Association for Computational Linguistics.
62