Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
short-paper

Morphological Segmentation and Part-of-Speech Tagging for the Arabic Heritage

Published: 02 April 2018 Publication History

Abstract

We annotate 60,000 words of Classical Arabic (CA) with topics in philosophy, religion, literature, and law with fine-grain segment-based morphological descriptions. We use these annotations for building a morphological segmenter and part-of-speech (POS) tagger for CA. With character-level classification and features from the word and its lexical context, the segmenter achieves a word accuracy of 96.8% with the main issue being a high rate of out-of-vocabulary words. A token-based POS tagger achieves an accuracy of 96.22% with 97.72% on known tokens despite the small size of the corpus. An error analysis shows that most of the tagging errors are results of segmentation and that quality improves with more data being added. The morphological segmenter and tagger have a wide range of potential applications in processing CA, a low-resource variety of the language.

References

[1]
Musaed Bin-Muqbil. 2006. Phonetic and Phonological Aspects of Arabic Emphatics and Gutturals. Ph.D. Dissertation. University of Wisconsin-Madison.
[2]
Walter Daelemans and Antal van den Bosch. 2005. Memory-Based Language Processing. Cambridge University Press.
[3]
Mona T. Diab. 2007. Improved Arabic base phrase chunking with a new enriched POS tag set. In Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources (Semitic’07). 89--96. http://dl.acm.org/citation.cfm?id=1654576.1654592.
[4]
Kais Dukes and Tim Buckwalter. 2010. A dependency treebank of the Quran using traditional Arabic grammar. In Proceedings of the 7th International Conference on Informatics and Systems (INFOS’10). 1--7.
[5]
O. Mohamed Elhadj. 2010. Statistical part-of-speech tagger for traditional Arabic texts. Journal of Computer Science 5, 11, 794--800.
[6]
Souhir Gahbiche, Helene Bonneau-Maynard, Thomas Lavergne, and Franois Yvon. 2012. Joint segmentation and POS tagging for Arabic using a CRF-based classifier. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12).
[7]
Nizar Habash and Owen Rambow. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05).
[8]
Seth Kulick. 2010. Simultaneous tokenization and part-of-speech tagging for Arabic without a morphological analyzer. In Proceedings of the ACL 2010 Conference Short Papers (ACLShort’10). 342--347. http://dl.acm.org/citation.cfm?id=1858842.1858905.
[9]
Sandra Kübler and Emad Mohamed. 2012. Part of speech tagging for Arabic. Natural Language Engineering 18, 4, 521--548.
[10]
Mohamed Maamouri, Ann Bies, Tim Buckwalter, and Wigdan Mekki. 2004. The Penn Arabic Treebank: Building a large-scale annotated Arabic corpus. In Proceedings of the NEMLAR Conference on Arabic Language Resources and Tools.
[11]
Christopher Manning and Dan Klein. 2003. Optimization, maxent models, and conditional estimation without magic. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Tutorials—Volume 5 (NAACL-Tutorials’03). 8.
[12]
Emad Mohamed. 2012. Morphological segmentation and part of speech tagging for religious Arabic. In Proceedings of the 2012 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT’12).
[13]
Emad Mohamed and Sandra Kübler. 2010. Is Arabic part of speech tagging feasible without word segmentation? In Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT’10). 705--708. http://dl.acm.org/citation.cfm?id=1857999.1858104.
[14]
Arfath Pasha, Mohamed Al-Badrashiny, Mona Diab, Ahmed El Kholy, Ramy Eskander, Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan Roth. 2014. MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 26--31.

Cited By

View all
  • (2022)[Retracted] Analysis of the Impact of the Traditional Literature Environment Based on Big Data Technology on Overseas LiteratureJournal of Environmental and Public Health10.1155/2022/28343632022:1Online publication date: 30-Sep-2022
  • (2022)CNN Based Character Recognition and Classification in Tamil Palm Leaf Manuscripts2022 International Conference on Communication, Computing and Internet of Things (IC3IoT)10.1109/IC3IOT53935.2022.9767866(1-6)Online publication date: 10-Mar-2022
  • (2021)Entanglement assisted training algorithm for supervised quantum classifiersQuantum Information Processing10.1007/s11128-021-03179-w20:8Online publication date: 1-Aug-2021
  • Show More Cited By

Index Terms

  1. Morphological Segmentation and Part-of-Speech Tagging for the Arabic Heritage

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 17, Issue 3
      September 2018
      196 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3184403
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 02 April 2018
      Accepted: 01 December 2017
      Revised: 01 December 2017
      Received: 01 June 2015
      Published in TALLIP Volume 17, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Arabic
      2. heritage
      3. morphological analysis
      4. part-of-speech tagging
      5. segmentation

      Qualifiers

      • Short-paper
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)9
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 12 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)[Retracted] Analysis of the Impact of the Traditional Literature Environment Based on Big Data Technology on Overseas LiteratureJournal of Environmental and Public Health10.1155/2022/28343632022:1Online publication date: 30-Sep-2022
      • (2022)CNN Based Character Recognition and Classification in Tamil Palm Leaf Manuscripts2022 International Conference on Communication, Computing and Internet of Things (IC3IoT)10.1109/IC3IOT53935.2022.9767866(1-6)Online publication date: 10-Mar-2022
      • (2021)Entanglement assisted training algorithm for supervised quantum classifiersQuantum Information Processing10.1007/s11128-021-03179-w20:8Online publication date: 1-Aug-2021
      • (2020)RETRACTED ARTICLE: Multimedia text classification algorithm using potential Dirichlet distribution in mobile cloud computing environmentMultimedia Tools and Applications10.1007/s11042-019-08253-179:13-14(9615-9627)Online publication date: 1-Apr-2020
      • (2019)Exploring the Performance of Tagging for the Classical and the Modern Standard ArabicAdvances in Fuzzy Systems10.1155/2019/62546492019Online publication date: 23-Jan-2019
      • (2019)Arabic-SOSProceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage10.1145/3322905.3322927(27-32)Online publication date: 8-May-2019

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media