Abstract
To date, there are no fully automated systems addressing the community’s need for fundamental language processing tools for Arabic text. In this chapter, we present a Support Vector Machine (SVM) based approach to automatically tokenize (segmenting off clitics), part-of- speech (POS) tag and annotate Base Phrase Chunks (BPC) in Modern Standard Arabic (MSA) text. We adapt highly accurate tools that have been developed for English text and apply them to Arabic text. Using standard evaluation metrics, we report that the (SVM-TOK) tokenizer achieves an Fß = 1 score of 99.1, the (SVM-POS) tagger achieves an accuracy of 96.6%, and the (SVM-BPC) chunker yields an Fß = 1 score of 91.6.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Allwein, E. L., Schapire, R. E. & Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1, 113–141.
Buchholz, S., Veenstra, J. & Daelemans, W. (1999). Cascaded grammatical relation assignment. In Proceedings of EMNLP/VLC (pp. 239–246).
Darwish, K. (2002). Building a shallow Arabic morphological analyser in one day. In Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages (pp. 47–54), Philadelpia, PA.
Diab, M., Hacioglu, K. & Jurafsky, D. (2004). Automatic tagging of Arabic text: From raw text to base phrase chunks. In Proceedings of North American Association for Computational Linguistics (NAACL, pp. 149–152).
Habash, N. & Rambow, O. (2005). Arabic Tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the Association for Computational Linguistics (ACL, pp. 573–580).
Habash, N. & Sadat, F. (2006).Arabic preprocessing schemes for statistical machine translation. In Proceedings of the North American chapter of the Association for Computational Linguistics (NAACL, pp. 49–52).
Hacioglu, K. & Ward, W. (2003). Target word detection and semantic role chunking using support vector machines. In Proceedings of Human Language Technology and North American Association for Computational Linguistics (HLT-NAACL, pp. 25–27).
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning (EMCL, pp. 137–142).
Khoja, S. (2001). APT: Arabic part-of-speech tagger. In Proceedings of the North American Association for Computational Linguistics Student Workshop (pp. 20–25).
Kudo, T. & Matsumato, Y. (2001). Use of support vector learning for chunk identification. In Proceedings of the North American Association for Computational Linguistics (NAACL).
Lee, Y.-S., Papineni, K., Roukos, S., Emam, O. & Hassan, H. (2003). Language model based Arabic word segmentation. In Proceedings of the 41st Meeting of the Association for Computational Linguistics (pp. 399–406).
Maamouri, M., Bies, A. & Buckwalter, T. (2004). The Penn Arabic treebank: Building a largescale annotated Arabic corpus. In NEMLAR Conference on Arabic Language Resources and Tools, Cairo, Egypt.
Ramshaw, L. A. & Marcus, M. P. (1995). Text chunking using transformation-based learning. In Proceedings of the Association for Computational Linguistics Workshop on Very Large Corpora (pp. 82–94).
Tjong Kim Sang, E. & Buchholz, S. (2000). Introduction to the CoNLL-2000 shared task: Chunking. In Proceedings of the 4th Conference on Computational Natural Language Learning (CoNLL, pp. 127–132).
Toutanova, K., Klein, D., Manning, C. & Singer, Y. (2003). Feature-Rich part-of-speech tagging with a cyclic dependency network. In Proceedings of Human Language Technology and North American Association for Computational Linguistics (HLT-NAACL, pp. 252–259).
Vapnik, V. (1995). The Nature of Statistical Learning Theory. New York: Springer Verlag.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2007 Springer
About this chapter
Cite this chapter
Diab, M., Hacioglu, K., Jurafsky, D. (2007). Automatic Processing of Modern Standard Arabic Text. In: Soudi, A., Bosch, A.v., Neumann, G. (eds) Arabic Computational Morphology. Text, Speech and Language Technology, vol 38. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-6046-5_9
Download citation
DOI: https://doi.org/10.1007/978-1-4020-6046-5_9
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-6045-8
Online ISBN: 978-1-4020-6046-5
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)