English–Arabic collocation extraction to enhance Arabic collocation identification

Zribi, Chiraz Ben Othmane

doi:10.1007/s10115-019-01428-0

English–Arabic collocation extraction to enhance Arabic collocation identification

Regular Paper
Published: 21 December 2019

Volume 62, pages 2439–2459, (2020)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Chiraz Ben Othmane Zribi¹

443 Accesses
1 Citation
Explore all metrics

Abstract

Bilingual collocation extraction could improve the performance of monolingual extraction. This is especially true for the English–Arabic pair, as difficulties of Arabic collocation extraction can be overcome. We present in this paper two novel approaches for extracting both monolingual and bilingual collocations. The monolingual extraction approach is hybrid, based on linguistic patterns and statistical measures. We propose during statistical filtering to combine vector-based measures with different association measures via a voting procedure. The bilingual extraction capitalizes on different cues (position, frequency, cross-language correspondence between POS-patterns, distribution, translation). It allows enhancing the monolingual collocation extraction by considering not only collocation equivalents with direct translation. Indeed, it can validate unconfirmed collocations because they translate confirmed ones. The results showed, in particular, how the extraction of Arabic collocations can be improved by extracting English–Arabic ones. The precision of extracting Arabic collocations moved upward, respectively, from about 86 to 96%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Arabic Collocation Extraction Based on Hybrid Methods

Arabic Corpus Linguistics: Major Progress, but Still a Long Way to Go

PARSEME-AR: Arabic reference corpus for multiword expressions using PARSEME annotation guidelines

Article 28 August 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

We use in this section indifferently MWE and collocation, since some researchers do not differentiate between them.
Available at https://camel.abudhabi.nyu.edu/madamira/.
Available at https://nlp.stanford.edu/software/tagger.shtml

References

Abdulgabbar MS, Juzaiddin A, Mohd JA (2011) An automatic collocation extraction from arabic corpus. J Comput Sci 7(1):6–11
Article Google Scholar
Akef AM, Wang Y, Yang E (2017) Arabic collocation extraction based on hybrid methods. In: The 16th China national conference (CCL 2017) and the 5th international symposium (NLP-NABD 2017), China, p 3–12
Altszyler E, Ribiero S, Sigman M, Fernandez-Slezak D (2017) Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database. In: Artificial intelligence symposium, Argentin, p 25
Attia M, Tounsi L, Pecina P, Genabith J, Toral A (2010) Automatic extraction of arabic multiword expressions. In: The Multiword Expressions: From Theory to Applications (MWE 2010), Beijing, p 19–27
Ben Othmane Zribi C, Baghouli B (2017) A syntactico-semantic method for Arabic collocations extraction. In: The 14th ACS/IEEE international conference on computer systems applications (AICCSA 2017), Tunisia, p 915–921
Bouamor D, Semmar N, Zweigenbaum P (2012) Identifying bilingual multi-word expressions for statistical machine translation. In: The 8th international conference on language resources and evaluation (LREC 2012), Turkey, p 674–679
Boulaknader S, Daille B, Boutajdine D (2008) A multi-word term extraction program for Arabic language. In: The 6th international conference on language resources and evaluation (LREC 2008), Morocco
Boros T, Burtica R (2018) GBD-NER at PARSEME Shared Task 2018: Multi-word expression detection using bidirectional long-short-term memory networks and graph-based decoding. In: The joint workshop on linguistic annotation, multiword expressions and constructions (LAW-MWE-CxG-2018), New Mexico, p 254–260
Daille B (2001) Extraction des collocations à partir de textes. In: Huitième Conférence Nationale sur le Traitement Automatique des Langues Naturelles (TALN 2001), France
DeNero J, Klein D (2008) The complexity of phrase alignment problems. In: The 46th annual meeting of the association for computational linguistics on human language technologies (ACL HLT 2008), Columbus, p 25–28
Fawi F, Delmonte R (2015) Italian–Arabic domain terminology extraction from parallel corpora. In: The 2th conference on computational linguistics (CLiC-it 2015), Torino, p 130–134
Garcia M, Garcia-Salido M, Alonso-Ramos M (2017) Using bilingual word-embeddings for multilingual collocation extraction. In: The 13th workshop on multiword expressions (MWE 2017), Spain, p 21–30
Grefenstette G, Teufel S (1995) Corpus-based methods for automatic identification of support verbs for nominalizations. In: The 7th conference of the European chapter of the association for computational linguistic (EACL 1995), Dublin
Heid U (1999) Extracting terminologically relevant collocations from German technical texts. In: The 5th international congress on terminology and knowledge engineering (TKE 1999), Austria, p 241–255
Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Discourse Processes 25:259–284
Article Google Scholar
Lehecka T (2015) Collocation and colligation, handbook of pragmatics. Östman, & J. Verschueren, Benjamins, Amsterdam
Google Scholar
Mandravickaite J, Rimkute E, Krilavicius T (2016) Hybrid approach for automatic identification of multi-word expressions in lithuanian. Hum Lang Technol Baltic Perspect 289(1):153–159
Google Scholar
Marchand M, Semmar N (2011) A hybrid multiword terms alignment arroach using word co-occurrence with a bilingual lexicon. In: The 5th conference of language and technology: human language technologies as a challenge for computer science and linguistics (LTC 2011), Poland, p 311–318
Mikolov T, Yih W, Zweig G (2013) Efficient estimation of word representations in vector space. In: The international conference of learning representations (ICLR 2013), Arizona
Mokrane MA (2006) Représentation de collections de documents textuels: application à la caractérisation thématique, PHD Thesis, Montpellier II University
Klyueva N, Doucet A, Straka M (2017) Neural networks for multi-word expression detection. In: The 13th workshop on multiword expressions (MWE 2017), Spain, p 60–65
Pasha A, Al-Badrashiny M, Diab M, El Kholy A, Eskander R, Habash N, Pooleery M, Rambow O, Roth RM (2014) MADAMIRA: a fast, comprehensive tool for morphological analysis and disambiguation of arabic. In: The 9th edition of the Language Resources and Evaluation Conference (LREC 2014), Iceland, p 1094–1101
Pecina P, Schlesinger P (2006) Combining association measures for collocation extraction. In: The 21th international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (COLING-ACL, 2006), Australia, p 651–658
Pecina P (2010) Lexical association measures and collocation extraction. J Lang Resour Eval Springer 44(1):137–158
Article Google Scholar
Pearce D (2001) Synonymy in collocation extraction, NAACL workshop: WordNet and other lexical resources: applications, extensions and customizations, Pittsburgh, p 41–46
Ramisch C, Villavicencio A, Boitet C (2010) MWE-Toolkit: a framework for multiword expression identification. In: The 7th conference on international language resources and evaluation (LREC 2010), Malta, p 662–669
Ramisch C, Villavicencio A, Kordoni V (2013) Introduction to the special issue on multiword expressions: from theory to practice and use. ACM Trans Speech Lang Process 10(2):3–10
Article Google Scholar
Rafalovitch A, Dale R (2009) United Nations general assembly resolutions: a six-language parallel corpus. MT Summit XII, Ottawa
Google Scholar
Rivera OM, Mitkov R, Pastor GC (2013) A flexible framework for collocation retrieval and translation from Parallel and comparable corpora. In: Workshop on multi-word units in machine translation and translation technology, France, p 18–25
Roberts W, Egg M (2018) A large automatically-acquired all-words list of multiword expressions scored for compositionality. The 11th conference on international language resources and evaluation (LREC 2018)—RobertsE18, Japan, p 304–310. https://dblp.uni-trier.de/db/conf/lrec/lrec2018.html
Salehi B, Cook P, Baldwin T (2015) A word embedding approach to predicting the compositionality of multiword expressions. In: The annual conference of the North American chapter of the ACL, Colorado, p 977–983
Semmar N (2018) A hybrid approach of automatic extraction of bilingual multiword expressions. In: The 11th conference on international language resources and evaluation (LREC 2018)—RobertsE18, Japan, p 311–318. https://dblp.uni-trier.de/db/conf/lrec/lrec2018.html
Seretan V, Wehrli E (2009) Multilingual collocation extraction with a syntactic parser. Lang Resour Eval 43(1):71–85
Article Google Scholar
Singh D, Bhingardive S, Bhattacharraya KPP (2015) Detection of multiword expressions for Hindi language using word embeddings and WordNet-based features. In: The 12th international conference on natural language processing, India, p 295–302
Smadja F (1993) Retrieving collocations from text: xtract. Comput Linguist Spec Issue Large Corpora 19(1):143–177
Google Scholar
Snajder B, Dalbelo B, Petrovi´c S, Sikiri´c I (2008) Evolving new lexical association measures using genetic programming. In: The 46th annual meeting of the association for computational linguistics on human language technologies, Association for Computational Linguistics, Columbus, p 181–184
Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: The conference of the North American chapter of the association for computational linguistics on human language technology (WLT-NAACL), Edmonton, p 173–180
Wanner L, Alonso Ramos M (2000) Vers une approche sémantique pour l’identification des collocations en corpus. Journée d’études de l’ATALA, La collocation, France
Google Scholar
Zaidi S, Abdellali L, Sadat F, Laskri M (2012) Hybrid approach for extracting collocations from arabic Quran texts, language resources and evaluation for religious LRE-Rel Workshop, Turkey

Download references

Author information

Authors and Affiliations

RIADI-GDL Laboratory, ENSI-La Manouba, Manouba, Tunisia
Chiraz Ben Othmane Zribi

Authors

Chiraz Ben Othmane Zribi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chiraz Ben Othmane Zribi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zribi, C.B.O. English–Arabic collocation extraction to enhance Arabic collocation identification. Knowl Inf Syst 62, 2439–2459 (2020). https://doi.org/10.1007/s10115-019-01428-0

Download citation

Received: 15 March 2019
Revised: 17 November 2019
Accepted: 26 November 2019
Published: 21 December 2019
Issue Date: June 2020
DOI: https://doi.org/10.1007/s10115-019-01428-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

English–Arabic collocation extraction to enhance Arabic collocation identification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Arabic Collocation Extraction Based on Hybrid Methods

Arabic Corpus Linguistics: Major Progress, but Still a Long Way to Go

PARSEME-AR: Arabic reference corpus for multiword expressions using PARSEME annotation guidelines

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

English–Arabic collocation extraction to enhance Arabic collocation identification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Arabic Collocation Extraction Based on Hybrid Methods

Arabic Corpus Linguistics: Major Progress, but Still a Long Way to Go

PARSEME-AR: Arabic reference corpus for multiword expressions using PARSEME annotation guidelines

Explore related subjects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation