Abstract
Multilingual linguistic resources are usually constructed from parallel corpora, but since these corpora are available only for selected text domains and language pairs, the potential of other resources is being explored as well. This article seeks to explore and to exploit the idea of using multilingual web-based encyclopaedias such as Wikipedia as comparable corpora for bilingual terminology extraction. We propose an approach to extract terms and their translations from different types of Wikipedia link information and data. The next step will be using linguistic-based information to re-rank and filter the extracted term candidates in the target language. Preliminary evaluations using the combined statistics-based and linguistic-based approaches were applied on different pairs of languages including Japanese, French and English. These evaluations showed a real open improvement and a good quality of the extracted term candidates for building or enriching multilingual anthologies, dictionaries or feeding a cross-language information retrieval system with the related expansion terms of the source query.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Adafre, S.F., De Rijke, M.: Finding similar sentences across multiple languages in Wikipedia. In: Proceedings of the EACL Workshop on NEW TEXT Wikis and Blogs and Other Dynamic Text Sources (2006)
Adar, E., Skinner, M., Weld, D.S.: Information arbitrage across multi-lingual Wikipedia. In: Proceedings of the Second ACM International Conference on Web Search and Data Mining, Barcelona, Spain, February 09-12 (2009)
Dejean, H., Gaussier, E., Sadat, F.: An Approach based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction. In: Proceedings of COLING 2002, Taiwan (2002)
Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1), 61–74 (1993)
Erdmann, M., Nakayama, K., Hara, T., Nishio, S.: An Approach for Extracting Bilingual Terminology from Wikipedia. In: Haritsa, J.R., Kotagiri, R., Pudi, V. (eds.) DASFAA 2008. LNCS, vol. 4947, pp. 380–392. Springer, Heidelberg (2008a)
Erdmann, M., Nakayama, K., Hara, T., Nishio, S.: Extraction of bilingual terminology from a multilingual Web-based encyclopedia. J. Inform. Process. (2008b)
Erdmann, M., Nakayama, K., Hara, T., Nishio, S.: Improving the extraction of bilingual terminology from Wikipedia. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP) 5(4) (October 2009)
Fung, P.: A Statistical View of Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel Corpora. In: Véronis, J. (ed.) Parallel Text Processing (2000)
Gœuriot, L., Daille, B., Morin, E.: Compilation of specialized comparable corpus in French and Japanese. Proceedings. In: ACL-IJCNLP Workshop “Building and Using Comparable Corpora” (BUCC 2009), Singapore (August 2009)
Gœuriot, L., Morin, E., Daille, B.: Reconnaissance de critères de comparabilité dans un corpus multilingue spécialisé. Actes. In: Sixième édition de la Conférence en Recherche d’Information et Applications, CORIA 2009 (2009)
Kun, Y., Tsujii, J.: Bilingual Dictionary Extraction from Wikipedia (2009a). In: Proceedings of MT Summit XII Proceedings 2009 (2009)
Kun, Y., Junichi, T.: Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity. In: Proceedings of NAACL HLT 2009: Short Papers, Boulder, Colorado, pp. 121–124 (June 2009b)
Mohammadi, M., QasemAgharee, N.: In: Proceedings of NIPS Workshop, Grammar Induction, Representation of Language and Language Learning, Whistler, Canada (December 2009)
Morin, E., Daille, B.: Extraction de terminologies bilingues à partir de corpus comparables d’un domaine spécialisé. Traitement Automatique des Langues (TAL), Lavoisier 45(3), 103–122 (2004)
Morin, E., Daille, B.: Comparabilité de corpus et fouille terminologique multilingue. Traitement Automatique des Langues (TAL) 47(1), 113–136 (2006)
Nakagawa, H.: Disambiguation of Lexical Translations Based on Bilingual Comparable Corpora. In: Proceedings of LREC 2000, Workshop of Terminology Resources and Computation, WTRC 2000, pp. 33–38 (2000)
Peters, C., Picchi, E.: Capturing the Comparable: A System for Querying Comparable Text Corpora. In: Proceedings of the Third International Conference on Statistical Analysis of Textual Data, pp. 255–262 (1995)
Rapp, R.: Automatic Identification of Word Translations from Unrelated English and German Corpora. In: Proceedings of European Chapter of the Association for Computational Linguistics, EACL (1999)
Sadat, F., Yoshikawa, M., Uemura, S.: Learning bilingual translations from comparable corpora to cross-language information retrieval: hybrid statistics-based and linguistics-based approach. In: Proceedings of EACL 2003, Workshop on Information Retrieval with Asian Languages, Sapporo, Japan, vol. 11, pp. 57–64 (2003)
Sadat, F.: Knowledge Acquisition from Collections of News Articles to Cross-language Information Retrieval. In: Proceedings of RIAO 2004 Conference, Avignon, France, pp. 504–513 (2004)
Véronis, J.: Parallel Text Processing: Alignment and Use of Translation Corpora. Kluwer Academic Publishers Ed., Dordrecht (2000)
Voss, J.: Measuring Wikipedia. In: Proceedings of 10th International Conference of the International Society for Scientometrics and Informetrics (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sadat, F. (2012). Exploiting a Web-Based Encyclopedia as a Knowledge Base for the Extraction of Multilingual Terminology. In: Isahara, H., Kanzaki, K. (eds) Advances in Natural Language Processing. JapTAL 2012. Lecture Notes in Computer Science(), vol 7614. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33983-7_9
Download citation
DOI: https://doi.org/10.1007/978-3-642-33983-7_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33982-0
Online ISBN: 978-3-642-33983-7
eBook Packages: Computer ScienceComputer Science (R0)