Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Advertisement

Urdu ligature recognition using multi-level agglomerative hierarchical clustering

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Optical character recognition (OCR) system holds great significance in human-machine interaction. OCR has been the subject of intensive research especially for Latin, Chinese and Japanese script. Comparatively, little work has been done for Urdu OCR, due to the complexities and segmentation errors associated with its cursive script. This paper proposes an Urdu OCR system which aims at ligature-level recognition of Urdu text. This ligature based recognition approach overcomes the character-levelsegmentation problems associated with cursive scripts. A newly developed OCR algorithm is introduced that uses a semi-supervised multi-level clustering for categorization of the ligatures. Classification is performed using four machine learning techniques i.e. decision trees, linear discriminant analysis, naive Bayes and k-nearest neighbor (K-NN). The system was implemented and the results show 62, 61, 73 and 90% accuracy for decision tree, linear discriminant analysis, naive Bayes and K-NN respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Habash, N.Y.: Introduction to Arabic natural language processing. Synthesis Lectures on Human Language Technologies 3(1), 1–187 (2010)

    Article  Google Scholar 

  2. Olszewska, J.I.: Active contour based optical character recognition for automated scene understanding. Neurocomputing 161, 65–71 (2015)

    Article  Google Scholar 

  3. Kharma, N.N., Ward, R.K.: Character recognition systems for the non-expert. IEEE Can. Rev. 33, 5–8 (1999)

    Google Scholar 

  4. Ahmad, R., Naz, S., Afzal, M.Z., Amin, S.H., Breuel, T.: Robust optical recognition of cursive Pashto script using scale, rotation and location invariant approach. PLoS ONE 10(9), e0133648 (2015)

    Article  Google Scholar 

  5. Choudhary, P., Nain, N.: A four-tier annotated urdu handwritten text image dataset for multidisciplinary research on Urdu Script. ACM Trans. Asian Low Resour. Lang. Inf. Process. 15(4), 26 (2016)

    Article  Google Scholar 

  6. Naz, S., Umar, A.I., Ahmad, R., Ahmed, S.B., Shirazi, S.H., Siddiqi, I., Razzak, M.I.: Offline cursive Urdu-Nastaliq script recognition using multidimensional recurrent neural networks. Neurocomputing 177, 228–241 (2016)

    Article  Google Scholar 

  7. Hakro, D.N., Talib, A.Z.: Printed text image database for Sindhi OCR. ACM Trans. Asian Low Resour. Lang. Inf. Process. 15(4), 21 (2016)

    Article  Google Scholar 

  8. Ahmad, Z., Orakzai, J.K., Shamsher, I., Adnan, A.: Urdu Nastaleeq Optical Character Recognition. In: Proceedings of World Academy of Science, Engineering and Technology, pp. 249–252 (2007)

  9. Husain, S.A.: A multi-tier holistic approach for Urdu Nastaliq recognition. In: Proceedings of the 8th International Multi Topic Conference, Abstracts 2002, pp. 79–84 (2002)

  10. Shah, Z.A.: Ligature based optical character recognition of Urdu-Nastaleeq font. In: Proceedings of 6th International Multitopic IEEE Conference (INMIC) (2002)

  11. Husain, S.A., Sajjad, A., Anwar, F.: Online Urdu character recognition system. In: MVA2007 IAPR Conference on Machine Vision Applications (2007)

  12. Khan, K., Siddique, M., Aamir, M., Khan, R.: An efficient method for Urdu language text search in image based Urdu text. IJCSI Int. J. Comput. Sci. Issues 9(2), 523–527 (2012)

    Google Scholar 

  13. Razzak, M.I., Husain, S.A., Mirza, A.A., Belaid, A.: Fuzzy based preprocessing using fusion of online and offline trait for online Urdu script based languages character recognition. Int. J. Innov. Comput. Inf. Control 8, 3149–3161 (2012)

    Google Scholar 

  14. Razzak, M.I., Anwar, F., Husain, S.A., Belaid, A., Sher, M.: HMM and fuzzy logic: a hybrid approach for online Urdu script-based languages’ character recognition. Knowl Based Syst. 23(8), 914–923 (2010). doi:10.1016/j.knosys.2010.06.007

    Article  Google Scholar 

  15. Akram, Q.u.A., Hussain, S., Habib, Z.: Font size independent OCR for Noori Nastaleeq. In: Proceedings of Graduate Colloquium on Computer Sciences (GCCS), NUCES, Lahore (2010)

  16. Javed, S.T., Hussain, S., Maqbool, A., Asloob, S., Jamil, S., Moin, H.: Segmentation Free Nastalique Urdu OCR. In: Proceedings of World Academy Of Science, Engineering and Technology, vol. 70 (2010)

  17. Sattar, S.A., Haque, S., Pathan, M.K.: A finite state model for Urdu Nastalique optical character recognition. Int. J. Comput. Sci. Netw. Security 9(9), 116 (2009)

    Google Scholar 

  18. Pal, U., Sarkar, A.: Recognition of Printed Urdu Script. Paper presented at the Proceedings of the Seventh International Conference on Document Analysis and Recognition, vol. 2 (2003)

  19. Malik, S., Khan, S.A.: Urdu online handwriting recognition. In: Proceedings of the IEEE Symposium on Emerging Technologies, vol. 17(18), Islamabad (2005)

  20. Chanda, S., Pal, U.: English, Devnagari and Urdu text identification. In: Proceedings of the International Conference on Cognition and Recognition, pp. 538–546 (2005)

  21. Pathan, R.R.J.I.K., Ali, A.A.: Recognition of offline handwritten isolated Urdu character. Adv. Comput. Res. 4(1), 117–121 (2012)

    Google Scholar 

  22. Zaman, S., Slany, W., Sahito, F.: Recognition of segmented Arabic/Urdu characters using pixel values as their features. In: ICCIT (2012)

  23. Shahzad, N., Paulson, B., Hammond, T.: Urdu Qaeda: Recognition system for isolated Urdu characters. In: IUI 2009 Workshop on Sketch Recognition, Sanibel Island, Florida (2009)

  24. Nawaz, T., Naqvi, S.A.H.S., ur Rehman, H.: Optical character recognition system for Urdu (Naskh Font) using pattern matching technique. Int. J. Image Process. 3, 92–104 (2009)

    Google Scholar 

  25. Ahmad, Z., Orakzai, J.K., Shamsher, I.: Urdu compound character recognition using feed forward neural networks. In: ICCSIT 2009, pp. 457–462 (2009)

  26. Shamsher, I., Ahmad, Z., Orakzai, J.K., Adnan, A.: OCR for printed Urdu Script using feed forward neural network. In: Proceedings of World Academy of Science, Engineering and Technology (2007)

  27. Javed, S.T., Hussain, S., Maqbool, A., Asloob, S., Jamil, S., Moin, H.: Segmentation free nastalique urdu OCR. In: Proceedings of World Academy of Science, Engineering and Technology, vol. 46, pp. 456–461 (2010)

  28. Ahmed, S.B., Naz, S., Razzak, M.I., Rashid, S.F., Afzal, M.Z., Breuel, T.M.: Evaluation of cursive and non-cursive scripts using recurrent neural networks. Neural Comput. Appl. 27(3), 603–613 (2016)

    Article  Google Scholar 

  29. Javed, S.T., Hussain, S.: Segmentation based Urdu Nastalique OCR. In: Iberoamerican Congress on Pattern Recognition 2013, pp. 41–49. Springer, Heidelberg (2013)

  30. Razzak, M.I., Husain, S.A., Mirza, A.A., Belaid, A.: Fuzzy based preprocessing using fusion of online and offline trait for online urdu script based languages character recognition. Int. J. Innov. Comput. Inf. Control 8(5), 21 (2012)

    Google Scholar 

  31. Wali, A., Hussain, S.: Context sensitive shape-substitution in nastaliq writing system: analysis and formulation. In: Innovations and Advanced Techniques in Computer and Information Sciences and Engineering. pp. 53–58. Springer, Heidelberg (2007)

  32. Hussain, S.: Complexity of Asian writing systems: a case study of Nafees Nasta’leeq for urdu. In: Proceedings of the 12th AMIC Annual Conference on e-Worlds: Governments, Business and Civil Society, Asian Media Information Center, Singapore 2003. Citeseer

  33. Naz, S., Hayat, K., Razzak, M.I., Anwar, M.W., Madani, S.A., Khan, S.U.: The optical character recognition of Urdu-like cursive scripts. Pattern Recognit. 47(3), 12291248 (2014)

    Article  Google Scholar 

  34. Naz, S., Hayat, K., Razzak, M.I., Anwar, M.W., Akbar, H.: Arabic script based character segmentation: a review. In: 2013 IEEE World Congress on Computer and Information Technology (WCCIT), pp. 1–6 (2013)

  35. Satti, D.A., Saleem, K.: Complexities and implementation challenges in offline Urdu Nastaliq OCR. In: Proceedings of the Conference on Language & Technology 2012, pp. 85–91 (2012)

  36. Sabbour, N., Shafait, F.: A segmentation-free approach to Arabic and Urdu OCR. In: IS&T/SPIE Electronic Imaging 2013. International Society for Optics and Photonics, pp. 86580N-86580N-86512 (2013)

  37. Akram, M., Hussain, S.: Word segmentation for Urdu OCR system. In: Proceedings of the 8th Workshop on Asian Language Resources, Beijing, China, pp. 88–94 (2010)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Naila Habib Khan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khan, N.H., Adnan, A. & Basar, S. Urdu ligature recognition using multi-level agglomerative hierarchical clustering. Cluster Comput 21, 503–514 (2018). https://doi.org/10.1007/s10586-017-0916-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-017-0916-2

Keywords