Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Consensus-based clustering for document image segmentation

Published: 01 December 2016 Publication History

Abstract

Segmentation of a document image plays an important role in automatic document processing. In this paper, we propose a consensus-based clustering approach for document image segmentation. In this method, the foreground regions of a document image are grouped into a set of primitive blocks, and a set of features is extracted from them. Similarities among the blocks are computed on each feature using a hypothesis test-based similarity measure. Based on the consensus of these similarities, clustering is performed on the primitive blocks. This clustering approach is used iteratively with a classifier to label each primitive block. Experimental results show the effectiveness of the proposed method. It is further shown in the experimental results that the dependency of classification performance on the training data is significantly reduced.

References

[1]
Abd Almageed, W., Agrawal, M., Seo, W., Doermann, D.: Document zone classification using partial least squares and hybrid classifiers. In: 19th International Conference on Pattern Recognition, 2008. ICPR 2008, pp. 1---4 (2008)
[2]
Ahmed, S., Shafait, F., Liwicki, M., Dengel, A.: A generic method for stamp segmentation using part-based features. In: 12th International Conference on Document Analysis and Recognition, ICDAR '13, pp. 708---712. IEEE Computer Society (2013)
[3]
Bloomberg, D.S.: Multiresolution morphological analysis of document images. SPIE Visual Commun. Image Process. 1818, 648---662 (1992)
[4]
Bouguelia, M. R., Belaid, Y., Belaid, A.: Document image and zone classification through incremental learning. In: 20th IEEE International Conference on Image Processing, ICIP '13, pp. 4230---4234 (2013)
[5]
Breiman, L.: Random forests. Mach. Learn. 45(1), 5---32 (2001)
[6]
Breuel, T. M.: Two geometric algorithms for layout analysis. In: 5th International Workshop on Document Analysis Systems V, DAS '02, pp. 188---199. Springer, London, UK (2002)
[7]
Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27:1---27:27 (2011)
[8]
Chen, N., Blostein, D.: A survey of document image classification: problem statement, classifier architecture and performance evaluation. Int. J. Doc. Anal. Recognit. (IJDAR) 10(1), 1---16 (2007)
[9]
Cohen, R., Asi, A., Kedem, K., El-Sana, J., Dinstein, I.: Robust text and drawing segmentation algorithm for historical documents. In: 2nd International Workshop on Historical Document Imaging and Processing, HIP '13, pp. 110---117. ACM, New York, NY, USA (2013)
[10]
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. MIT Press, Cambridge (2009)
[11]
Dey, S., Mukherjee, J., Sural, S.: Stamp and logo detection from document images by finding outliers. In: Fifth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, NCVPRIPG '15, pp. 1---4 (2015)
[12]
Dey, S., Mukherjee, J., Sural, S., Bhowmick, P.: Colored rubber stamp removal from document images. PReMI '13, pp. 545---550. Springer, Berlin (2013)
[13]
Kise, K., Sato, A., Iwata, M.: Segmentation of page images using the area voronoi diagram. Comput. Vis. Image Underst. 70(3), 370---382 (1998)
[14]
Dey, S., Mukhopadhyay, J., Sural, S., Bhowmick, P.: Margin noise removal from printed document images. DAR '12, pp. 86---93. ACM, New York, NY, USA (2012)
[15]
Douglas, D.H., Peucker, T.M.: Algorithm for the reduction of the number of points required to represent a digitized line or its caricature. Cartogr. Int. J. Geogr. Inf. Geovis. 10(2), 112---122 (1973)
[16]
Dueck, D.: Affinity propagation: clustering data by passing messages. PhD Thesis Graduate Department of Electrical and Computer Engineering University of Toronto (2009)
[17]
Epshtein, B., Ofek, E., Wexler, Y.: Detecting text in natural scenes with stroke width transform. In: International Conference on Computer Vision and Pattern Recognition, CVPR'10, pp. 2963---2970 (2010)
[18]
Ester, M., Kriegel, H. P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of 2nd International Conference on Knowledge Discovery, pp. 226---231 (1996)
[19]
Fletcher, L.A., Kasturi, R.: A robust algorithm for text string separation from mixed text/graphics images. IEEE Trans. Pattern Anal. Mach. Intell. 10(6), 910---918 (1988)
[20]
Forczmański, P., Markiewicz, A.: Stamps detection and classification using simple features ensemble. Math. Probl. Eng., page Article ID 367879 (2014)
[21]
Garg, R., Hassan, E., Chaudhury, S., Gopal, M.: A CRF based scheme for overlapping multi-colored text graphics separation. In: 11th International Conference on Document Analysis and Recognition, ICDAR '11, vol. 2015, pp. 1---15. IEEE Computer Society (2011)
[22]
Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 3rd edn. Prentice-Hall Inc, Upper Saddle River (2009)
[23]
Grana, C., Borghesani, D., Cucchiara, R.: Automatic segmentation of digitalized historical manuscripts. Multimed. Tools Appl. 55(3), 483---506 (2011)
[24]
Guo, J. K., Ma, M. Y.: Separating handwritten material from machine printed text using hidden Markov models. In: 6th International Conference on Document Analysis and Recognition, ICDAR '01, pp. 439 ---443. IEEE Computer Society (2001)
[25]
Haji, M., Sahoo, K. A., Bui, T. D., Suen, C. Y., Ponson, D.: Statistical hypothesis testing for handwritten word segmentation algorithms. In: International Conference on Frontiers in Handwriting Recognition, ICFHR'12, pp. 114---119 (2012)
[26]
Hearn, D., Baker, M.P.: Computer Graphics, C Version, 2nd edn. Pearson Education, Upper Saddle River (2007)
[27]
Hines, W.W., Montgomery, D.C., Goldsman, D.M., Borror, C.M.: Probability and Statistics in Engineering, 4th edn. Wiley India, New Delhi (2012)
[28]
Hu, W., Xie, N., Hu, R., Ling, H., Chen, Q., Yan, S., Maybank, S.: Bin ratio-based histogram distances and their application to image classification. IEEE Trans. Pattern Anal. Mach. Intell. 36(12), 2338---2352 (2014)
[29]
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193---218 (1985)
[30]
Kise, K.: Page segmentation techniques in document analysis. In: Doermann, D., Tombre, K. (eds.) Handbook of Document Image Processing and Recognition, pp. 135---175. Springer, London (2014)
[31]
Krishnamoorthy, M., Nagy, G., Seth, S., Viswanathan, M.: Syntactic segmentation and labeling of digitized pages from technical journals. IEEE Trans. Pattern Anal. Mach. Intell. 15(7), 737---747 (1993)
[32]
Kumar, S., Gupta, R., Chaudhury, S., Khanna, N., Joshi, S.D.: Text extraction and document image segmentation using matched wavelets and MRF model. IEEE Trans. Image Process. 16(8), 2117---2128 (2007)
[33]
Manning, C.D., Raghavanand, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
[34]
Meunier, J. L.: Optimized xy-cut for determining a page reading order. In: 8th International Conference on Document Analysis and Recognition, ICDAR '05, vol. 1, pp. 347---351 (2005)
[35]
Micenkova, B., Beusekom, J. V.: Stamp detection in color document images. In: 11th International Conference on Document Analysis and Recognition, ICDAR '11, pp. 1125---1129. IEEE Computer Society (2011)
[36]
Murphy, K.P.: Machine Learning: A Probabilistic Perspective. The MIT Press, Cambridge (2012)
[37]
Nagy, G.: Twenty years of document image analysis in PAMI. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 38---62 (2000)
[38]
Nagy, G., Seth, S.: Hierarchical representation of optically scanned documents. In: 7th International conference on Pattern Recognition, ICPR '84, pp. 347---349 (1984)
[39]
Nandedkar, A., Mukherjee, J., Sural, S.: Text-graphics separation to detect logo and stamp from color document images: A spectral approach. In: 13th International Conference on Document Analysis and Recognition, ICDAR '15, pp. 571---575 (2015)
[40]
Nikolaou, N., Makridis, M., Gatos, B., Stamatopoulos, N., Papamarkos, N.: Segmentation of historical machine-printed documents using adaptive run length smoothing and skeleton segmentation paths. Image Vis. Comput. 28(4), 590---604 (2010)
[41]
O'Gorman, L.: The document spectrum for page layout analysis. IEEE Trans. Pattern Anal. Mach. Intell. 15(11), 1162---1173 (1993)
[42]
Papadopoulos, C., Pletschacher, S., Antonacopoulos, A., Clausner, C.: ICDAR2015 competition on recognition of documents with complex layouts--RDCL2015. In: 13th International Conference on Document Analysis and Recognition, ICDAR '15, pp. 1151---1155. IEEE Computer Society (2015)
[43]
Pavlidis, T., Zhou, J.: Page segmentation and classification. CVGIP. Graph. Models Image Process. 54(6), 484---496 (1992)
[44]
Peng, X., Setlur, S., Govindaraju, V., Sitaram, R., Bhuvanagiri, K.: Markov random field based text identification from annotated machine printed documents. In: 10th International Conference on Document Analysis and Recognition, ICDAR '09, pp. 431---435. IEEE Computer Society (2009)
[45]
Rosenberg, A., Hirschberg, J.: V-measure: A conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 410---420 (2007)
[46]
Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 45(4), 427---437 (2009)
[47]
Sutheebanjard, P., Premchaiswadi, W.: A modified recursive x-y cut algorithm for solving block ordering problems. In: 2nd International Conference on Computer Engineering and Technology, ICCET '10, pp. V3---307---V3---311 (2010)
[48]
Suzuki, S., Abe, K.: Topological structural analysis of digitized binary images by border following. Comput. Vis. Graph. Image Process. 30(1), 32---46 (1985)
[49]
The Wilcoxon matched-pairs signed-ranks test. http://www.fon.hum.uva.nl/Service/Statistics/Signed_Rank_Test.html. Accessed 15 Aug 2015
[50]
Vinh, N. X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: Is a correction for chance necessary?. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, pp. 1073---1080. ACM, New York, NY, USA (2009)
[51]
Wahl, F.M., Wong, K.Y., Casey, R.G.: Block segmentation and text extraction in mixed text/image documents. Comput. Graph. Image Process. 20(4), 375---390 (1982)
[52]
Wang, Y., Phillips, I.T., Haralick, R.M.: Document zone content classification and its performance evaluation. Pattern Recognit. 39(1), 57---73 (2006)
[53]
Zheng, Y., Li, H., Doermann, D.: Machine printed text and handwriting identification in noisy document images. IEEE Trans. Pattern Anal. Mach. Intell. 26(3), 337---353 (2004)
[54]
Zhu, G., Doermann, D.: Automatic document logo detection. In: 9th International Conference on Document Analysis and Recognition, ICDAR '07, pp. 864---868. IEEE Computer Society (2007)
[55]
Zhu, G., Jaeger, S., Doermann, D.: A robust stamp detection framework on degraded documents. In: SPIE Conference on Document Recognition and Retrieval, DRR '06, pp. 1---9 (2006)

Cited By

View all
  • (2017)Document Image Page Segmentation and Character Recognition as Semantic SegmentationProceedings of the 4th International Workshop on Historical Document Imaging and Processing10.1145/3151509.3151518(101-106)Online publication date: 10-Nov-2017

Recommendations

Comments

Information & Contributors

Information

Published In

cover image International Journal on Document Analysis and Recognition
International Journal on Document Analysis and Recognition  Volume 19, Issue 4
December 2016
88 pages
ISSN:1433-2833
EISSN:1433-2825
Issue’s Table of Contents

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 December 2016

Author Tags

  1. Clustering
  2. Document analysis
  3. Hypothesis testing
  4. Segmentation
  5. Stroke width

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 09 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2017)Document Image Page Segmentation and Character Recognition as Semantic SegmentationProceedings of the 4th International Workshop on Historical Document Imaging and Processing10.1145/3151509.3151518(101-106)Online publication date: 10-Nov-2017

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media