Abstract
Extensive work has been done on different activities of natural language processing for Western languages as compared to its Eastern counterparts particularly South Asian Languages. Western languages are termed as resource-rich languages. Core linguistic resources e.g. corpora, WordNet, dictionaries, gazetteers and associated tools being developed for Western languages are customarily available. Most South Asian Languages are low resource languages e.g. Urdu is a South Asian Language, which is among the widely spoken languages of sub-continent. Due to resources scarcity not enough work has been conducted for Urdu. The core objective of this paper is to present a survey regarding different linguistic resources that exist for Urdu language processing, to highlight different tasks in Urdu language processing and to discuss different state of the art available techniques. Conclusively, this paper attempts to describe in detail the recent increase in interest and progress made in Urdu language processing research. Initially, the available datasets for Urdu language are discussed. Characteristic, resource sharing between Hindi and Urdu, orthography, and morphology of Urdu language are provided. The aspects of the pre-processing activities such as stop words removal, Diacritics removal, Normalization and Stemming are illustrated. A review of state of the art research for the tasks such as Tokenization, Sentence Boundary Detection, Part of Speech tagging, Named Entity Recognition, Parsing and development of WordNet tasks are discussed. In addition, impact of ULP on application areas, such as, Information Retrieval, Classification and plagiarism detection is investigated. Finally, open issues and future directions for this new and dynamic area of research are provided. The goal of this paper is to organize the ULP work in a way that it can provide a platform for ULP research activities in future.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The diacritics (called zer-e-izafat or hamza-e-izafat) are optional, and are not written in the example given.
References
Abbas Q (2014) Semi-semantic part of speech annotation and evaluation. In: Proceedings of ACL 8th Linguistic Annotation Workshop held in conjunction with COLING, Association of Computational Linguistics, pp 75–81
Adeeba F, Hussain S (2011) Experiences in building the UrduWordNet. In: Proceedings of the 9th workshop on Asian language resources, pp 31–35
Ahmed T, Hautli A (2010) Developing a basic lexical resource for Urdu using Hindi WordNet. In: Proceedings of CLT10, Islamabad, Pakistan
Ahmed T, Hautli A (2011) A first approach towards an UrduWordNet. Linguist Lit Rev 6(1):1–14
Akram Q, Naseer A, et al. (2009) Assas-band, an affix-exception-list based Urdu stemmer. In: Proceedings of the 7th workshop on Asian language resources, pp 40–46
Ali S, Khlid S, Saleemi MH (2014) A novel stemming approach for Urdu language. J Appl Environ Biol Sci 4(7S):436–443
Ali A, Ijaz M (2009) Urdu text classification. In: Proceedings of the 7th international conference on frontiers of information technology, pp 1–7
Al-Shammari (2008) Towards an error free stemming. In: Proceedings of ACM workshop on improving non English web searching, pp 9–16
Anwar W et al (2006) A survey of automatic Urdu language processing. In: Proceedings of conference on machine learning and cybernetics, pp 4489–4494
Anwar W, et al (2007) A statistical based part of speech tagger for Urdu language. In: Proceedings of IEEE international conference on machine learning and cybernetics, pp 3418–3424
Attia M (2007) Arabic tokenization system. In: Proceedings of the Urdu2007 workshop on computational approaches to semitic languages: common issues and resources, pp 65–72
Baker A, Hardie P et al (2003) Corpus data for south Asian language processing. In: Proceedings of the 10th annual workshop for South Asian language processing, pp 1–8
Becker D, Riaz K (2002) A study in Urdu corpus construction. In: Proceedings of Urdu 3rd workshop on Asian language resources and international standardization, pp 1–5
Biemann C (2006) Unsupervised part-of-speech tagging employing efficient graph clustering. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics: student research workshop, pp 7–12
Capstick J, Diagne AK, Erbach G, Uszkoreit H, Leisenberg A, Leisenberg M (2000) A system for supporting cross-lingual information retrieval. Inf Process Manag 36(2):275–289
Chiong R, Wei W (2006) Named entity recognition using hybrid machine learning approach. In: Proceedings of international conference on cognitive informatics, pp 578–583
CLE (2015) Urdu digest POS tagged corpus. Retrieved 2015-08-07, from http://www.cle.org.pk/software/localization.htm
Daud A et al (2010) Knowledge discovery through directed probabilistic topic models a survey. Front Comput Sci 4(2):280–301
Durrani N, Hussain S (2010) Urdu word segmentation. In: Proceedings of international conference on human language technologies, pp 528–536
Ekbal A, et al. (2008) Named entity recognition in Bengali: a conditional random field approach. In: Proceedings of the 3rd international joint conference on natural language processing (ijcnlp), pp 589–594
Ekbal A, Haque R, Das A, Poka V, Bandyopadhyay S (2008). Language independent named entity recognition in Indian languages. In: Proceedings of the IJCNLP workshop on NER for South and SouthEast Asian languages, pp 33–40
Estahbanati S, Javidan R (2011) A new stemmer for Farsi language. In: Proceedings of international symposium on computer science and software engineering (CSSE), pp 25–29
Fellbaum C (1998). WordNet. Blackwell Publishing Ltd, New York
Flagship (2012) Undergraduate program and resource center for Hindi-Urdu at the university of Texas at Austin. Retrieved 2015-03-09, from http://HindiUrduflagship.org/about/two-languages-or-one/
Gali K, et al (2008) Aggregating machine learning and rule-based heuristics for named entity recognition. In: Proceedings of the ijcnlp-08 workshop on NER for South and SouthEast Asian languages, pp 25–32
Graça J et al (2011) Controlling complexity in part-of-speech induction. J Artif Intell Res 41(2):527–551
Gupta V, Joshi N, Mathur I (2013) Rule based stemmer in Urdu. In: Proceedings of IEEE 4th international conference on computer and communication technology (ICCCT), pp. 129–132
Gupta V, Joshi N, Mathur I (2015) Design and development of rule based inflectional and derivational Urdu stemmer ‘Usal’. In: Proceedings of IEEE international conference on futuristic trends on computational analysis and knowledge management (ABLAZE), pp. 7–12
Hardie A (2003) Developing a tagset for automated part-of-speech tagging in Urdu. In: Proceedings of conference on corpus linguistics, Lancaster, pp 1–7
Henderson R, Deane S (2003) Xml made simple. Routledge
Horváth T et al (1999) Application of different learning methods to Hungarian part-of-speech tagging. Induc Logic Programm 1634(1):128–139
Humayoun M, et al. (2007) Urdu morphology, orthography and lexicon extraction. In: Second workshop on computational approaches to Arabic script-based languages,(caasl-2: Lsa), pp 1–8
Hussain S (2008) Resources for Urdu language processing. In: Proceedings of the 6th workshop on Asian language resources (IJCNLP’08), pp 99–100
Imran MR (2011) Online Urdu character recognition in unconstrained environment (doctoral dissertation, International Islamic University, Islamabad)
Jafar R, et al (2004) Language oriented parsing through morphologically closed word classes in Urdu. In: Proceedings of IEEE student conference on engineering, sciences and technology, pp. 19–24
Jawaid B, Ahmed T (2009) Hindi to Urdu conversion: beyond simple transliteration. In: Proceedings of the conference on language and technology, pp. 24–31
Kabir H, et al. (2002) Two pass parsing implementation for an Urdu grammar checker. In: Proceedings of IEEE international multi topic conference, pp. 1–8
Kaplan R (2005) A method for tokenizing text. CSLI Publications, Stanford, UK
Khan SA, Anwar W, Bajwa UI, Wang X (2012) A light weight stemmer for Urdu language: a scarce resourced language. In: 24th international conference on computational linguistics, pp 69–78
Khan M, et al. (2011) Copy detection in Urdu language documents using n-grams model. In: Proceedings of international conference on computer networks and information technology (ICCNIT), pp 263–266
Lehal, et al. (2012) Rule based Urdu stemmer. In: Proceeding of the 24th international conference on computational linguistics, pp 267–276
Lehal, G. (2010). A two stage word segmentation system for handling space insertion problem in Urdu script. In: Proceedings of the 1st workshop on south and southeast Asian natural language processing (WASSANLP), the 23rd international conference on computational linguistics(COLING), pp 43–50
Lehal, G. S. (2013). Ligature segmentation for Urdu OCR. In: Proceedings of IEEE 12th international conference on document analysis and recognition (ICDAR), pp. 1130–1134
Matsukawa T, et al. (1993) Example-based correction of word segmentation and part of speech labeling. In: Proceedings of the workshop on human language technology, pp 227–232
Meknavin S, et al. (1997) Feature-based Thai word segmentation. In: Proceedings of natural language processing Pacific Rimsymposium (NLRPS), pp. 35–46
Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38(11):39–41
Mukhtar N et al (2012) Algorithm for developing Urdu probabilistic parser. Int J Electr Comput Sci IJECS-IJENS 12(3):57–66
Mukund, S., & Srihari., R. (2009). NE tagging for Urdu based on bootstrap POS learning. In: Proceedings of third international cross lingual information access workshop, pp. 61–69
Mukund S et al (2010) An information-extraction system for Urdu-a resource-poor language. ACM Trans Asian Lang Inf Process 9(4):1–43
Mukund S, Srihari R (2012) An NLP framework for non-topical text analysis in Urdu—a resource poor language (unpublished doctoral dissertation). State University of New York at Buffalo
Naz F et al (2012) Urdu part of speech tagging using transformation based error driven learning. World Appl Sci J 3(16):437–448
Naz S et al (2014) Challenges of Urdu named entity recognition: a scarce resource language. Res J Appl Sci Eng Technol 8(10):1272–1278
Paik J, et al. (2011). A novel corpus-based stemming algorithm using co-occurrence statistics. In: Proceedings of the 34th international ACMSIGIR conference on research and development ininformation retrieval, pp 863–872
Pandey AK, Siddiqui TJ (2009) Evaluating effect of stemming and stop-word removal on hindi text retrieval. In: Tiwary US, Siddiqui TJ, Radhakrishna M, Tiwari MD (eds) Proceedings of the first international conference on intelligent human computer interaction. Springer, pp 316–326
Prasad, K., & Virk., S. (2012). Computational evidence that Hindi and Urdu share a grammar but not the lexicon. In: Proceedings of the 24th international conference on computational linguistics (COLING), pp 1–13
Raj S, Rehman Z, Rauf S, Siddique R, Anwar W (2015) An artificial neural network approach for sentence boundary sisambiguation in Urdu language text. Int Arab J Inf Technol 12(4):395–400
Ranta A (2004) Grammatical framework: a type-theoretical grammar formalism. J Funct Programm 14(2):145–189
Rehman Z et al (2012) A hybrid approach for Urdu sentence boundary disambiguation. Int Arab J Inf Technol 9(3):250–255
Rehman Z, et al. (2011) Challenges in Urdu text tokenization and sentence boundary disambiguation. In: Proceedings of the 2nd workshop on South and Southeast Asian natural language processing (WASSANLP 2011), pp 40–45
Riaz K (2007) Challenges in Urdu stemming. In: Proceedings of BCS IRSG symposium on future directions in information access, pp 1–4
Riaz K (2008a) Baseline for UrduIR evaluation. In: Proceedings of the 2nd ACM workshop on improving on English web searching, pp 97–100
Riaz K (2008b) Concept search in Urdu. In: Proceedings of the 2nd PhD workshop on information and knowledge management, pp 33–40
Riaz K (2009) Urdu is not Hindi for information access. SIGIR workshop on information access in a multilingual World, pp 53–57
Riaz K (2010) Rule-based named entity recognition in Urdu. In: Proceedings of the 2010 named entities workshop, pp 12–35
Riaz K (2012) Comparison of Hindi and Urdu in computational context. Int J Comput Linguist Nat Lang Process 1(3):92–97
Rizvi, S., & Hussain, M. (2005). Analysis, design and implementation of Urdu morphological analyzer. In Proceedings of student conference on engineering sciences and technology (sconest), pp 1–7
Sajjad H (2007) Statistical part of speech tagger for Urdu. Master unpublished thesis: National University of Computer and Emerging Sciences. Lahore, Pakistan
Sajjad H, Schmid H (2009) Tagging Urdu text with part of speech: a tagger comparison. In: Proceedings of the 12th conference of the European chapter of the association for computational linguistics, pp 692–700
Sattar SA (2009) A technique for the design and implementation of an OCR for printed Nastaliq text. Doctoral dissertation, NED University of Engineering and Technology, Karachi
Schmidt R (1999) Urdu: an essential grammar (1st edn). British library catalog using in publication data: Routledge 11 New Fetter Lane, London EC4P 4EE
Singh U et al. (2012) Named entity recognition system for Urdu. In: Proceedings of international conference on Urdu, pp 2507–2518
Small and George (1908) A grammar of the Hindustani of Urdu language (30th edn). California digital library: London : K. Paul, Trench, Trübner Co., ltd
Thoongsup S et al (2009) Thai WordNet construction. In: Proceedings of the 7th workshop on Asian language resources, pp 139–144
Visweswariah K, et al. (2010) Urdu and Hindi: translation and sharing of linguistic resources. In: Proceedings of the 23rd international conference on computational linguistics (COLING), pp 1283–1291
Wong DF, Chao LS, Zeng X (2014) Isentenizer-\(\mu \): multilingual sentence boundary detection model. Sci World J 2014:1–10
Yang C, Li K (2005) A heuristic method based on a statistical approach for Chinese text segmentation. J Am Soc Inform Sci Technol 56(13):1438–1447
Zafar A, et al. (2012) Developing Urdu WordNet using the merge approach. In: Proceedings of conference on language and technology, pp 55–59
Zhang C, Baldwin T, Ho H, Kimelfeld B, Li Y (2013) Adaptive parser-centric text normalization. In: ACL (1), pp 1159–1168
Zhou L, Liu Q (2002) A character-net based Chinese text segmentation method. In: Proceedings of the Urdu 2002 workshop on building and using semantic networks, pp 1–6
Acknowledgments
The work is supported by Higher Education Commission (HEC), Islamabad, Pakistan.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Daud, A., Khan, W. & Che, D. Urdu language processing: a survey. Artif Intell Rev 47, 279–311 (2017). https://doi.org/10.1007/s10462-016-9482-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-016-9482-x