Abstract
Bilingual collocation extraction could improve the performance of monolingual extraction. This is especially true for the English–Arabic pair, as difficulties of Arabic collocation extraction can be overcome. We present in this paper two novel approaches for extracting both monolingual and bilingual collocations. The monolingual extraction approach is hybrid, based on linguistic patterns and statistical measures. We propose during statistical filtering to combine vector-based measures with different association measures via a voting procedure. The bilingual extraction capitalizes on different cues (position, frequency, cross-language correspondence between POS-patterns, distribution, translation). It allows enhancing the monolingual collocation extraction by considering not only collocation equivalents with direct translation. Indeed, it can validate unconfirmed collocations because they translate confirmed ones. The results showed, in particular, how the extraction of Arabic collocations can be improved by extracting English–Arabic ones. The precision of extracting Arabic collocations moved upward, respectively, from about 86 to 96%.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
We use in this section indifferently MWE and collocation, since some researchers do not differentiate between them.
Available at https://camel.abudhabi.nyu.edu/madamira/.
Available at https://nlp.stanford.edu/software/tagger.shtml
References
Abdulgabbar MS, Juzaiddin A, Mohd JA (2011) An automatic collocation extraction from arabic corpus. J Comput Sci 7(1):6–11
Akef AM, Wang Y, Yang E (2017) Arabic collocation extraction based on hybrid methods. In: The 16th China national conference (CCL 2017) and the 5th international symposium (NLP-NABD 2017), China, p 3–12
Altszyler E, Ribiero S, Sigman M, Fernandez-Slezak D (2017) Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database. In: Artificial intelligence symposium, Argentin, p 25
Attia M, Tounsi L, Pecina P, Genabith J, Toral A (2010) Automatic extraction of arabic multiword expressions. In: The Multiword Expressions: From Theory to Applications (MWE 2010), Beijing, p 19–27
Ben Othmane Zribi C, Baghouli B (2017) A syntactico-semantic method for Arabic collocations extraction. In: The 14th ACS/IEEE international conference on computer systems applications (AICCSA 2017), Tunisia, p 915–921
Bouamor D, Semmar N, Zweigenbaum P (2012) Identifying bilingual multi-word expressions for statistical machine translation. In: The 8th international conference on language resources and evaluation (LREC 2012), Turkey, p 674–679
Boulaknader S, Daille B, Boutajdine D (2008) A multi-word term extraction program for Arabic language. In: The 6th international conference on language resources and evaluation (LREC 2008), Morocco
Boros T, Burtica R (2018) GBD-NER at PARSEME Shared Task 2018: Multi-word expression detection using bidirectional long-short-term memory networks and graph-based decoding. In: The joint workshop on linguistic annotation, multiword expressions and constructions (LAW-MWE-CxG-2018), New Mexico, p 254–260
Daille B (2001) Extraction des collocations à partir de textes. In: Huitième Conférence Nationale sur le Traitement Automatique des Langues Naturelles (TALN 2001), France
DeNero J, Klein D (2008) The complexity of phrase alignment problems. In: The 46th annual meeting of the association for computational linguistics on human language technologies (ACL HLT 2008), Columbus, p 25–28
Fawi F, Delmonte R (2015) Italian–Arabic domain terminology extraction from parallel corpora. In: The 2th conference on computational linguistics (CLiC-it 2015), Torino, p 130–134
Garcia M, Garcia-Salido M, Alonso-Ramos M (2017) Using bilingual word-embeddings for multilingual collocation extraction. In: The 13th workshop on multiword expressions (MWE 2017), Spain, p 21–30
Grefenstette G, Teufel S (1995) Corpus-based methods for automatic identification of support verbs for nominalizations. In: The 7th conference of the European chapter of the association for computational linguistic (EACL 1995), Dublin
Heid U (1999) Extracting terminologically relevant collocations from German technical texts. In: The 5th international congress on terminology and knowledge engineering (TKE 1999), Austria, p 241–255
Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Discourse Processes 25:259–284
Lehecka T (2015) Collocation and colligation, handbook of pragmatics. Östman, & J. Verschueren, Benjamins, Amsterdam
Mandravickaite J, Rimkute E, Krilavicius T (2016) Hybrid approach for automatic identification of multi-word expressions in lithuanian. Hum Lang Technol Baltic Perspect 289(1):153–159
Marchand M, Semmar N (2011) A hybrid multiword terms alignment arroach using word co-occurrence with a bilingual lexicon. In: The 5th conference of language and technology: human language technologies as a challenge for computer science and linguistics (LTC 2011), Poland, p 311–318
Mikolov T, Yih W, Zweig G (2013) Efficient estimation of word representations in vector space. In: The international conference of learning representations (ICLR 2013), Arizona
Mokrane MA (2006) Représentation de collections de documents textuels: application à la caractérisation thématique, PHD Thesis, Montpellier II University
Klyueva N, Doucet A, Straka M (2017) Neural networks for multi-word expression detection. In: The 13th workshop on multiword expressions (MWE 2017), Spain, p 60–65
Pasha A, Al-Badrashiny M, Diab M, El Kholy A, Eskander R, Habash N, Pooleery M, Rambow O, Roth RM (2014) MADAMIRA: a fast, comprehensive tool for morphological analysis and disambiguation of arabic. In: The 9th edition of the Language Resources and Evaluation Conference (LREC 2014), Iceland, p 1094–1101
Pecina P, Schlesinger P (2006) Combining association measures for collocation extraction. In: The 21th international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (COLING-ACL, 2006), Australia, p 651–658
Pecina P (2010) Lexical association measures and collocation extraction. J Lang Resour Eval Springer 44(1):137–158
Pearce D (2001) Synonymy in collocation extraction, NAACL workshop: WordNet and other lexical resources: applications, extensions and customizations, Pittsburgh, p 41–46
Ramisch C, Villavicencio A, Boitet C (2010) MWE-Toolkit: a framework for multiword expression identification. In: The 7th conference on international language resources and evaluation (LREC 2010), Malta, p 662–669
Ramisch C, Villavicencio A, Kordoni V (2013) Introduction to the special issue on multiword expressions: from theory to practice and use. ACM Trans Speech Lang Process 10(2):3–10
Rafalovitch A, Dale R (2009) United Nations general assembly resolutions: a six-language parallel corpus. MT Summit XII, Ottawa
Rivera OM, Mitkov R, Pastor GC (2013) A flexible framework for collocation retrieval and translation from Parallel and comparable corpora. In: Workshop on multi-word units in machine translation and translation technology, France, p 18–25
Roberts W, Egg M (2018) A large automatically-acquired all-words list of multiword expressions scored for compositionality. The 11th conference on international language resources and evaluation (LREC 2018)—RobertsE18, Japan, p 304–310. https://dblp.uni-trier.de/db/conf/lrec/lrec2018.html
Salehi B, Cook P, Baldwin T (2015) A word embedding approach to predicting the compositionality of multiword expressions. In: The annual conference of the North American chapter of the ACL, Colorado, p 977–983
Semmar N (2018) A hybrid approach of automatic extraction of bilingual multiword expressions. In: The 11th conference on international language resources and evaluation (LREC 2018)—RobertsE18, Japan, p 311–318. https://dblp.uni-trier.de/db/conf/lrec/lrec2018.html
Seretan V, Wehrli E (2009) Multilingual collocation extraction with a syntactic parser. Lang Resour Eval 43(1):71–85
Singh D, Bhingardive S, Bhattacharraya KPP (2015) Detection of multiword expressions for Hindi language using word embeddings and WordNet-based features. In: The 12th international conference on natural language processing, India, p 295–302
Smadja F (1993) Retrieving collocations from text: xtract. Comput Linguist Spec Issue Large Corpora 19(1):143–177
Snajder B, Dalbelo B, Petrovi´c S, Sikiri´c I (2008) Evolving new lexical association measures using genetic programming. In: The 46th annual meeting of the association for computational linguistics on human language technologies, Association for Computational Linguistics, Columbus, p 181–184
Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: The conference of the North American chapter of the association for computational linguistics on human language technology (WLT-NAACL), Edmonton, p 173–180
Wanner L, Alonso Ramos M (2000) Vers une approche sémantique pour l’identification des collocations en corpus. Journée d’études de l’ATALA, La collocation, France
Zaidi S, Abdellali L, Sadat F, Laskri M (2012) Hybrid approach for extracting collocations from arabic Quran texts, language resources and evaluation for religious LRE-Rel Workshop, Turkey
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zribi, C.B.O. English–Arabic collocation extraction to enhance Arabic collocation identification. Knowl Inf Syst 62, 2439–2459 (2020). https://doi.org/10.1007/s10115-019-01428-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-019-01428-0