Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

English–Arabic collocation extraction to enhance Arabic collocation identification

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Bilingual collocation extraction could improve the performance of monolingual extraction. This is especially true for the English–Arabic pair, as difficulties of Arabic collocation extraction can be overcome. We present in this paper two novel approaches for extracting both monolingual and bilingual collocations. The monolingual extraction approach is hybrid, based on linguistic patterns and statistical measures. We propose during statistical filtering to combine vector-based measures with different association measures via a voting procedure. The bilingual extraction capitalizes on different cues (position, frequency, cross-language correspondence between POS-patterns, distribution, translation). It allows enhancing the monolingual collocation extraction by considering not only collocation equivalents with direct translation. Indeed, it can validate unconfirmed collocations because they translate confirmed ones. The results showed, in particular, how the extraction of Arabic collocations can be improved by extracting English–Arabic ones. The precision of extracting Arabic collocations moved upward, respectively, from about 86 to 96%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. We use in this section indifferently MWE and collocation, since some researchers do not differentiate between them.

  2. Available at https://camel.abudhabi.nyu.edu/madamira/.

  3. Available at https://nlp.stanford.edu/software/tagger.shtml

References

  1. Abdulgabbar MS, Juzaiddin A, Mohd JA (2011) An automatic collocation extraction from arabic corpus. J Comput Sci 7(1):6–11

    Article  Google Scholar 

  2. Akef AM, Wang Y, Yang E (2017) Arabic collocation extraction based on hybrid methods. In: The 16th China national conference (CCL 2017) and the 5th international symposium (NLP-NABD 2017), China, p 3–12

  3. Altszyler E, Ribiero S, Sigman M, Fernandez-Slezak D (2017) Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database. In: Artificial intelligence symposium, Argentin, p 25

  4. Attia M, Tounsi L, Pecina P, Genabith J, Toral A (2010) Automatic extraction of arabic multiword expressions. In: The Multiword Expressions: From Theory to Applications (MWE 2010), Beijing, p 19–27

  5. Ben Othmane Zribi C, Baghouli B (2017) A syntactico-semantic method for Arabic collocations extraction. In: The 14th ACS/IEEE international conference on computer systems applications (AICCSA 2017), Tunisia, p 915–921

  6. Bouamor D, Semmar N, Zweigenbaum P (2012) Identifying bilingual multi-word expressions for statistical machine translation. In: The 8th international conference on language resources and evaluation (LREC 2012), Turkey, p 674–679

  7. Boulaknader S, Daille B, Boutajdine D (2008) A multi-word term extraction program for Arabic language. In: The 6th international conference on language resources and evaluation (LREC 2008), Morocco

  8. Boros T, Burtica R (2018) GBD-NER at PARSEME Shared Task 2018: Multi-word expression detection using bidirectional long-short-term memory networks and graph-based decoding. In: The joint workshop on linguistic annotation, multiword expressions and constructions (LAW-MWE-CxG-2018), New Mexico, p 254–260

  9. Daille B (2001) Extraction des collocations à partir de textes. In: Huitième Conférence Nationale sur le Traitement Automatique des Langues Naturelles (TALN 2001), France

  10. DeNero J, Klein D (2008) The complexity of phrase alignment problems. In: The 46th annual meeting of the association for computational linguistics on human language technologies (ACL HLT 2008), Columbus, p 25–28

  11. Fawi F, Delmonte R (2015) Italian–Arabic domain terminology extraction from parallel corpora. In: The 2th conference on computational linguistics (CLiC-it 2015), Torino, p 130–134

  12. Garcia M, Garcia-Salido M, Alonso-Ramos M (2017) Using bilingual word-embeddings for multilingual collocation extraction. In: The 13th workshop on multiword expressions (MWE 2017), Spain, p 21–30

  13. Grefenstette G, Teufel S (1995) Corpus-based methods for automatic identification of support verbs for nominalizations. In: The 7th conference of the European chapter of the association for computational linguistic (EACL 1995), Dublin

  14. Heid U (1999) Extracting terminologically relevant collocations from German technical texts. In: The 5th international congress on terminology and knowledge engineering (TKE 1999), Austria, p 241–255

  15. Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Discourse Processes 25:259–284

    Article  Google Scholar 

  16. Lehecka T (2015) Collocation and colligation, handbook of pragmatics. Östman, & J. Verschueren, Benjamins, Amsterdam

    Google Scholar 

  17. Mandravickaite J, Rimkute E, Krilavicius T (2016) Hybrid approach for automatic identification of multi-word expressions in lithuanian. Hum Lang Technol Baltic Perspect 289(1):153–159

    Google Scholar 

  18. Marchand M, Semmar N (2011) A hybrid multiword terms alignment arroach using word co-occurrence with a bilingual lexicon. In: The 5th conference of language and technology: human language technologies as a challenge for computer science and linguistics (LTC 2011), Poland, p 311–318

  19. Mikolov T, Yih W, Zweig G (2013) Efficient estimation of word representations in vector space. In: The international conference of learning representations (ICLR 2013), Arizona

  20. Mokrane MA (2006) Représentation de collections de documents textuels: application à la caractérisation thématique, PHD Thesis, Montpellier II University

  21. Klyueva N, Doucet A, Straka M (2017) Neural networks for multi-word expression detection. In: The 13th workshop on multiword expressions (MWE 2017), Spain, p 60–65

  22. Pasha A, Al-Badrashiny M, Diab M, El Kholy A, Eskander R, Habash N, Pooleery M, Rambow O, Roth RM (2014) MADAMIRA: a fast, comprehensive tool for morphological analysis and disambiguation of arabic. In: The 9th edition of the Language Resources and Evaluation Conference (LREC 2014), Iceland, p 1094–1101

  23. Pecina P, Schlesinger P (2006) Combining association measures for collocation extraction. In: The 21th international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (COLING-ACL, 2006), Australia, p 651–658

  24. Pecina P (2010) Lexical association measures and collocation extraction. J Lang Resour Eval Springer 44(1):137–158

    Article  Google Scholar 

  25. Pearce D (2001) Synonymy in collocation extraction, NAACL workshop: WordNet and other lexical resources: applications, extensions and customizations, Pittsburgh, p 41–46

  26. Ramisch C, Villavicencio A, Boitet C (2010) MWE-Toolkit: a framework for multiword expression identification. In: The 7th conference on international language resources and evaluation (LREC 2010), Malta, p 662–669

  27. Ramisch C, Villavicencio A, Kordoni V (2013) Introduction to the special issue on multiword expressions: from theory to practice and use. ACM Trans Speech Lang Process 10(2):3–10

    Article  Google Scholar 

  28. Rafalovitch A, Dale R (2009) United Nations general assembly resolutions: a six-language parallel corpus. MT Summit XII, Ottawa

    Google Scholar 

  29. Rivera OM, Mitkov R, Pastor GC (2013) A flexible framework for collocation retrieval and translation from Parallel and comparable corpora. In: Workshop on multi-word units in machine translation and translation technology, France, p 18–25

  30. Roberts W, Egg M (2018) A large automatically-acquired all-words list of multiword expressions scored for compositionality. The 11th conference on international language resources and evaluation (LREC 2018)—RobertsE18, Japan, p 304–310. https://dblp.uni-trier.de/db/conf/lrec/lrec2018.html

  31. Salehi B, Cook P, Baldwin T (2015) A word embedding approach to predicting the compositionality of multiword expressions. In: The annual conference of the North American chapter of the ACL, Colorado, p 977–983

  32. Semmar N (2018) A hybrid approach of automatic extraction of bilingual multiword expressions. In: The 11th conference on international language resources and evaluation (LREC 2018)—RobertsE18, Japan, p 311–318. https://dblp.uni-trier.de/db/conf/lrec/lrec2018.html

  33. Seretan V, Wehrli E (2009) Multilingual collocation extraction with a syntactic parser. Lang Resour Eval 43(1):71–85

    Article  Google Scholar 

  34. Singh D, Bhingardive S, Bhattacharraya KPP (2015) Detection of multiword expressions for Hindi language using word embeddings and WordNet-based features. In: The 12th international conference on natural language processing, India, p 295–302

  35. Smadja F (1993) Retrieving collocations from text: xtract. Comput Linguist Spec Issue Large Corpora 19(1):143–177

    Google Scholar 

  36. Snajder B, Dalbelo B, Petrovi´c S, Sikiri´c I (2008) Evolving new lexical association measures using genetic programming. In: The 46th annual meeting of the association for computational linguistics on human language technologies, Association for Computational Linguistics, Columbus, p 181–184

  37. Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: The conference of the North American chapter of the association for computational linguistics on human language technology (WLT-NAACL), Edmonton, p 173–180

  38. Wanner L, Alonso Ramos M (2000) Vers une approche sémantique pour l’identification des collocations en corpus. Journée d’études de l’ATALA, La collocation, France

    Google Scholar 

  39. Zaidi S, Abdellali L, Sadat F, Laskri M (2012) Hybrid approach for extracting collocations from arabic Quran texts, language resources and evaluation for religious LRE-Rel Workshop, Turkey

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chiraz Ben Othmane Zribi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zribi, C.B.O. English–Arabic collocation extraction to enhance Arabic collocation identification. Knowl Inf Syst 62, 2439–2459 (2020). https://doi.org/10.1007/s10115-019-01428-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-019-01428-0

Keywords