Exploiting a Web-Based Encyclopedia as a Knowledge Base for the Extraction of Multilingual Terminology

Sadat, Fatiha

doi:10.1007/978-3-642-33983-7_9

Fatiha Sadat²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7614))

Included in the following conference series:

International Conference on NLP

1574 Accesses

Abstract

Multilingual linguistic resources are usually constructed from parallel corpora, but since these corpora are available only for selected text domains and language pairs, the potential of other resources is being explored as well. This article seeks to explore and to exploit the idea of using multilingual web-based encyclopaedias such as Wikipedia as comparable corpora for bilingual terminology extraction. We propose an approach to extract terms and their translations from different types of Wikipedia link information and data. The next step will be using linguistic-based information to re-rank and filter the extracted term candidates in the target language. Preliminary evaluations using the combined statistics-based and linguistic-based approaches were applied on different pairs of languages including Japanese, French and English. These evaluations showed a real open improvement and a good quality of the extracted term candidates for building or enriching multilingual anthologies, dictionaries or feeding a cross-language information retrieval system with the related expansion terms of the source query.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Automatic Comparable Web Corpora Collection and Bilingual Terminology Extraction for Specialized Dictionary Making

Bilingual Terminology Mining from Language for Special Purposes Comparable Corpora

TermFinder: log-likelihood comparison and phrase-based statistical machine translation models for bilingual terminology extraction

Article 03 February 2018

References

Adafre, S.F., De Rijke, M.: Finding similar sentences across multiple languages in Wikipedia. In: Proceedings of the EACL Workshop on NEW TEXT Wikis and Blogs and Other Dynamic Text Sources (2006)
Google Scholar
Adar, E., Skinner, M., Weld, D.S.: Information arbitrage across multi-lingual Wikipedia. In: Proceedings of the Second ACM International Conference on Web Search and Data Mining, Barcelona, Spain, February 09-12 (2009)
Google Scholar
Dejean, H., Gaussier, E., Sadat, F.: An Approach based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction. In: Proceedings of COLING 2002, Taiwan (2002)
Google Scholar
Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1), 61–74 (1993)
Google Scholar
Erdmann, M., Nakayama, K., Hara, T., Nishio, S.: An Approach for Extracting Bilingual Terminology from Wikipedia. In: Haritsa, J.R., Kotagiri, R., Pudi, V. (eds.) DASFAA 2008. LNCS, vol. 4947, pp. 380–392. Springer, Heidelberg (2008a)
Chapter Google Scholar
Erdmann, M., Nakayama, K., Hara, T., Nishio, S.: Extraction of bilingual terminology from a multilingual Web-based encyclopedia. J. Inform. Process. (2008b)
Google Scholar
Erdmann, M., Nakayama, K., Hara, T., Nishio, S.: Improving the extraction of bilingual terminology from Wikipedia. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP) 5(4) (October 2009)
Google Scholar
Fung, P.: A Statistical View of Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel Corpora. In: Véronis, J. (ed.) Parallel Text Processing (2000)
Google Scholar
Gœuriot, L., Daille, B., Morin, E.: Compilation of specialized comparable corpus in French and Japanese. Proceedings. In: ACL-IJCNLP Workshop “Building and Using Comparable Corpora” (BUCC 2009), Singapore (August 2009)
Google Scholar
Gœuriot, L., Morin, E., Daille, B.: Reconnaissance de critères de comparabilité dans un corpus multilingue spécialisé. Actes. In: Sixième édition de la Conférence en Recherche d’Information et Applications, CORIA 2009 (2009)
Google Scholar
Kun, Y., Tsujii, J.: Bilingual Dictionary Extraction from Wikipedia (2009a). In: Proceedings of MT Summit XII Proceedings 2009 (2009)
Google Scholar
Kun, Y., Junichi, T.: Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity. In: Proceedings of NAACL HLT 2009: Short Papers, Boulder, Colorado, pp. 121–124 (June 2009b)
Google Scholar
Mohammadi, M., QasemAgharee, N.: In: Proceedings of NIPS Workshop, Grammar Induction, Representation of Language and Language Learning, Whistler, Canada (December 2009)
Google Scholar
Morin, E., Daille, B.: Extraction de terminologies bilingues à partir de corpus comparables d’un domaine spécialisé. Traitement Automatique des Langues (TAL), Lavoisier 45(3), 103–122 (2004)
Google Scholar
Morin, E., Daille, B.: Comparabilité de corpus et fouille terminologique multilingue. Traitement Automatique des Langues (TAL) 47(1), 113–136 (2006)
Google Scholar
Nakagawa, H.: Disambiguation of Lexical Translations Based on Bilingual Comparable Corpora. In: Proceedings of LREC 2000, Workshop of Terminology Resources and Computation, WTRC 2000, pp. 33–38 (2000)
Google Scholar
Peters, C., Picchi, E.: Capturing the Comparable: A System for Querying Comparable Text Corpora. In: Proceedings of the Third International Conference on Statistical Analysis of Textual Data, pp. 255–262 (1995)
Google Scholar
Rapp, R.: Automatic Identification of Word Translations from Unrelated English and German Corpora. In: Proceedings of European Chapter of the Association for Computational Linguistics, EACL (1999)
Google Scholar
Sadat, F., Yoshikawa, M., Uemura, S.: Learning bilingual translations from comparable corpora to cross-language information retrieval: hybrid statistics-based and linguistics-based approach. In: Proceedings of EACL 2003, Workshop on Information Retrieval with Asian Languages, Sapporo, Japan, vol. 11, pp. 57–64 (2003)
Google Scholar
Sadat, F.: Knowledge Acquisition from Collections of News Articles to Cross-language Information Retrieval. In: Proceedings of RIAO 2004 Conference, Avignon, France, pp. 504–513 (2004)
Google Scholar
Véronis, J.: Parallel Text Processing: Alignment and Use of Translation Corpora. Kluwer Academic Publishers Ed., Dordrecht (2000)
MATH Google Scholar
Voss, J.: Measuring Wikipedia. In: Proceedings of 10th International Conference of the International Society for Scientometrics and Informetrics (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Université du Quebec à Montréal, 201 av. President Kennedy, Montréal, QC, H3X 2Y3, Canada
Fatiha Sadat

Authors

Fatiha Sadat
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Information and Media Center, Toyohashi Universtiy of Technology, 1-1 Hibarigaoka, Tenpakucho, 441-8580, Toyohashi, Japan
Hitoshi Isahara & Kyoko Kanzaki &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sadat, F. (2012). Exploiting a Web-Based Encyclopedia as a Knowledge Base for the Extraction of Multilingual Terminology. In: Isahara, H., Kanzaki, K. (eds) Advances in Natural Language Processing. JapTAL 2012. Lecture Notes in Computer Science(), vol 7614. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33983-7_9

Download citation

DOI: https://doi.org/10.1007/978-3-642-33983-7_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33982-0
Online ISBN: 978-3-642-33983-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Exploiting a Web-Based Encyclopedia as a Knowledge Base for the Extraction of Multilingual Terminology

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Automatic Comparable Web Corpora Collection and Bilingual Terminology Extraction for Specialized Dictionary Making

Bilingual Terminology Mining from Language for Special Purposes Comparable Corpora

TermFinder: log-likelihood comparison and phrase-based statistical machine translation models for bilingual terminology extraction

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Exploiting a Web-Based Encyclopedia as a Knowledge Base for the Extraction of Multilingual Terminology

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Automatic Comparable Web Corpora Collection and Bilingual Terminology Extraction for Specialized Dictionary Making

Bilingual Terminology Mining from Language for Special Purposes Comparable Corpora

TermFinder: log-likelihood comparison and phrase-based statistical machine translation models for bilingual terminology extraction

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation