Abstract
Semantic interpretation of language requires extensive and rich lexical knowledge bases (LKB). The Basque WordNet is a LKB based on WordNet and its multilingual counterparts EuroWordNet and the Multilingual Central Repository. This paper reviews the theoretical and practical aspects of the Basque WordNet lexical knowledge base, as well as the steps and methodology followed in its construction. Our methodology is based on the joint development of wordnets and annotated corpora. The Basque WordNet contains 32,456 synsets and 26,565 lemmas, and is complemented by a hand-tagged corpus comprising 59,968 annotations.
Similar content being viewed by others
Notes
In order to see the specific analysis and the conclusions drawn from it, refer to (Pociello, 2008).
The hypernymy/hyponymy relation is also referred as the subset/superset relation.
All the expressions have been taken from WordNet 3.0 (http://wordnetweb.princeton.edu/perl/webwn), with some editing in synsets, literals and glosses due to space limitations.
We have consulted Google Scholar in September 2010. Moreover, a WordNet bibliography with more than 400 papers is maintained at http://lit.csci.unt.edu/~wordnet/.
We use WordNet (upper case) for the original Princeton WordNet, while we use wordnet (lower case) for the rest.
Although top ontologies classify a limited number of synsets, the synsets below them can also inherit the classification.
Note that the texts included in EuSemCor were chosen independently from the English SemCor.
Given that Basque is an agglutinative language, it has a higher lemma/word rate than English. Estimates in parallel corpora allow us to think that 300,000 words in Basque are comparable to 500,000 words in English.
Nouns in the corpus were ordered according to frequency, from most to least frequent. The editor follows this order to select words. That way it is possible to ensure that the most frequent nouns are properly edited and tagged.
The reasons for choosing these dictionaries should be pointed out: firstly, we were given the chance to use them electronically, because of the close contacts the IXA Group has with the dictionary makers; and secondly, because the dictionaries are widely used for specialised (Euskalterm) and general purposes.
The whole semantic class of the example has 22 hyponyms, but in the example only the direct hyponyms of the hyponym merrymaking have been given. The number of literals of the synsets has also been reduced.
Basque WordNet: http://ixa2.si.ehu.es/mcr/wei.html. Basque SemCor: http://sisx04.si.ehu.es:8080/EuSemCor.
References
Agirre, E., Aldezabal, I., Etxeberria, J., Izagirre, E., Mendizabal, K., Quintian, M., & Pociello, E. (2005). EuSemCor: Euskarako corpusa semantikoki etiketatzeko eskuliburua: Editatze- etiketatze- eta epaitze-lanak. Technical report, University of the Basque Country.
Agirre, E., Ansa, O., Arregi, X., Arriola, J., Díaz de Ilarraza, A., Pociello, E., & Uria, L. (2002). Methodological issues in the building of the Basque WordNet: Quantitative and qualitative analysis. In Proceedings of first international wordnet conference. Mysore, India.
Agirre, E., Ansa, O., Arregi, X., Artola, X., Zubillaga, X., Díaz de Ilarraza, A., & Lersundi, M. (2003). A conceptual schema for a Basque lexical-semantic framework. In Conference on computational lexicography and text research. Budapest, Hungary.
Agirre, E., & Lersundi, M. (2001). Extraccióon de relaciones léxico-semánticas a partir de palabras derivadas usando patrones de definición. In Proceedings of the annual SEPLN meeting. Jaén, Spain.
Agirre, E., & Martinez, D. (2002). Integrating selectional preferences in WordNet. In Proceedings of first international WordNet conference. Mysore, India.
Aldezabal, I. (2004). Aditz-azpikategorizazioaren azterketa sintaxi partzialetik sintaxi osorako bidean. 100 aditzen azterketa. Levin-en (1993) lana oinarri hartuta eta metodo informatikoak baliatuz. PhD thesis, University of the Basque Country.
Atserias, J., Villarejo, L., Rigau, G., Agirre, E., Carroll, J., Magnini, B., & Vossen, P. (2004). The MEANING multilingual central repository. In Proceedings of the 2nd global WordNet conference. Brno, Czech Republic.
Bentivogli, L., & Pianta, E. (2002). Extending WordNet with syntagmatic information. In Proceedings of second global WordNet conference. Brno, Czech Republic.
Calzolari, N., Fillmore, C., Grishman, R., Ide, N., Lenci, A., MacLeod, C., & Zampolli, A. (2002). Towards best practice for multiword expressions in computational lexicons. In Proceedings of the 3rd international conference on language resources and evaluation (LREC 2002). Las Palmas, Spain.
Carletta, J. (1996). Assessing agreement on classication tasks: The kappa statistic. Computational Linguistics, 22(2), 249–254.
Contreras, J. M., & Sueñer, A. (2004). Los procesos de la lexicalización. In E. Perez Gaztelu, I. Zabala, & L. Gràcia (Eds.), Las fronteras de la composición en lenguas románicas y en vasco (pp. 47–109). Deusto: University of Deusto.
Cowie, A. P., Mackin, R., & McCaig, I. R. (1990). Oxford dictionary of current Idiomatic English: Verbs with prepositions and particles, v2. London: Oxford University Press.
Cruse, A. (2000). Meaning in language: An introduction to semantics and pragmatics. London: Oxford University Press.
Elhuyar, (1996). Elhuyar Hiztegia: Euskara-gaztelania. Donostia: Elhuyar Kultur Elkartea.
Elhuyar, (1998). Elhuyar Hiztegi Txikia. Donostia: Elhuyar Kultur Elkartea.
Elhuyar, (2000). Hiztegi Modernoa. Donostia: Elhuyar Kultur Elkartea.
Euskaltzaindia, (2000). Hiztegi Batua. Donostia: Elkar.
Fellbaum, C. (1998). WordNet. An electronic lexical database. Cambridge (Massachussetts): MIT Press.
Fellbaum, C., Palmer, M., Dang, H. T., Delfs, L., & Wolf, S. (2001). Manual and automatic semantic annotation with WordNet. In Proceedings of the NAACL 2001 workshop on WordNet and other lexical resources. Pittsburgh.
Fernández, A., Saint-Dizier, P., Vázquez, G., Kamel, M., & Benamara, F. (2002). The Volem project: A framework for the construction of advanced multilingual lexicons. In Proceedings of language engineering conference (LEC’02). Hyderabad, India.
Fillmore, C. J., & Baker, C. F. (2001). FrameNet: Frame semantics meets the corpus. In Proceedings of WordNet and other lexical resources workshop. Pittsburgh.
Francopoulo, G., Bel, N., George, M., Calzolari, N., Monachini, M., Pet, M., & Soria, C. (2007). Lexical markup framework: ISO standard for semantic information in NLP lexicons. GLDV (Gesellschaft für linguistische Datenverarbeitung), Tubingen.
Gonzalo, J., Chugur, I., Verdejo, F. (2000). Sense clusters for information retrieval: Evidencerom SemCor and the EuroWordNet interlingual index. In Proceedings of the SIGLEX workshop on word senses and multilinguality, in conjunction with ACL-2000. Hong Kong, China.
Jackendoff, R. S. (1990). Semantic structure. Cambridge (Massachussetts): MIT Press.
Kingsbury, P., & Palmer, M. (2002). From TreeBank to PropBank. In Proceedings of the 3rd international conference on language resources and evaluation (LREC-2002). Las Palmas, Spain.
Lersundi, M. (2005). Ezagutza-base lexikala eraikitzeko Euskal Hiztegiko definizioen azterketa sintaktikosemantikoa. Hitzen arteko erlazio lexiko-semantikoak: Definizio-patroiak, eratorpena eta postposizioak. PhD thesis, University of the Basque Country.
Levin, B. (1993). English verb classes and alternations. A preliminary investigation. Chicago: The University of Chicago Press.
Lewandowski, T. (1992). Diccionario de Lingüística. Cátedra.
Miller, G. A. (1985). WordNet: A dictionary browser. In Proceedings of the first international conference on information in data. Waterloo.
Miller, G. A., Chodorow, M., Landes, S., Leacock, C., & Thomas, R. G. (1994). Using a semantic concordance for sense identification. In Proceedings of the ARPA human language technology workshop. San Francisco.
Niles, I., & Pease, A. (2001). Towards a standard upper ontology. In Proceedings of the 2nd international conference on formal ontology in information systems, FOIS 2001. Ogunquit, Maine.
Peters, W., & Peters, I. (2000). Automatic sense clustering in EuroWordNet. In Proceedings of LREC-2000. Athens, Greece.
Pociello, E. (2008). Euskararen ezagutza-base lexikala: Euskal WordNet. PhD thesis, University of the Basque Country.
Pociello, E., Gurrutxaga, A., Agirre, E., Aldezabal, I., & Rigau, G. (2008). WNTERM: Combining the Basque WordNet and a Terminological Dictionary. In Proceedings of the 6th international conference on language resources and evaluations (LREC). Marrakech.
Pustejovsky, J. (1995). The Generative Lexicon. Cambridge: MIT Press.
Rigau, G., Agirre, E., & Atserias, J. (2003). The MEANING project. In Proceedings of the XIX Congreso de la Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN). Alcalá de Henares (Madrid).
Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. In Proceedings of the third international conference on intelligent text processing and computational linguistics (CICLING 2002). Mexico City, Mexico.
Sarasola, I. (1996). Euskal Hiztegia. Kutxa Gizarte eta Kultur Fundazioa, Donostia.
Stamou, S., Oflazer, K., Pala, K., Christoudoulakis, D., Cristea, D., Tufis, D., Koeva, S., Totkov, G., Dutoit, D., & Grigoriadou, M. (2002). Balkanet: A multilingual semantic network for the Balkan Languages. In Proceedings of first international WordNet conference. Mysore, India.
Tufis, D., Cristea, D., & Stamou, S. (2004). BalkaNet: Aims, methods, results and perspectives. A general overview. Romanian Journal of Information science and technology, 7-1-2, 9–44.
UZEI (1987). Euskalterm. http://www1.euskadi.net/euskalterm/indice_c.htm. Accessed 17 March 2010.
Vossen, P. (1997). EuroWordNet: A multilingual database for information retrieval. In Proceedings of the DELOS workshop on cross-language information retrieval. Zurich.
Vossen, P. (1998). EuroWordNet: A multilingual database with lexical semantic networks. Dordrecht: Kluwer Academic Publishers.
Vossen, P. (1999). EuroWordNet general document. EuroWordNet (LE2-4003, LE4-8328), part a, final document deliverable D032D033/2D014.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pociello, E., Agirre, E. & Aldezabal, I. Methodology and construction of the Basque WordNet. Lang Resources & Evaluation 45, 121–142 (2011). https://doi.org/10.1007/s10579-010-9131-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-010-9131-y