Enhanced Text Extraction from Arabic Degraded Document Images using EM Algorithm

Aymen Bougacha

2009 10th International Conference on Document Analysis and Recognition Enhanced Text Extraction from Arabic Degraded Document Images using EM Algorithm Wafa Boussellaa1 , Aymen Bougacha1, Abderrazak Zahour2, Haikal EL Abed3, Adel Alimi 1 1 University of Sfax, REGIM, ENIS, Route Soukra, BPW, 3038, Sfax, Tunisia 2 IUT, Université du Havre, Place Robert Schuman, 76610 Le Havre, France 3 Technical University Braunschweig, Institute for Communication Technology (IfN), Germany Email:{wafa.boussellaa,adel.alimi,elabed}@ieee.org, Aymen.Bougacha@gmail.com, abderrazak.zahour@univ-lehavre.fr In this paper an enhanced text extraction method is proposed on the basis of the maximum likelihood (ML) estimation for the segmentation problem. The difficult task lies in how to estimate the parameters of the likelihood functions and the number of segments. Eventually, the expectation maximization algorithm (EM) algorithm is introduced in order to improve the parameter estimation. The initial estimates for the EM algorithm are given by the k-means clustering algorithm to avoid the problem of random initial selection. The text and background segmentation is performed by the conventional ML method. The segmented image is used to produce a final colored restored document image The rest of this paper is organized as follows. Section 2 gives an overview of previous work of degraded document image enhancement. Section 3 and 4 describes the proposed algorithm in details. Experimental results are presented in section 5. Conclusion and future work are given in last section. Abstract This paper presents a new enhanced text extraction algorithm from degraded document images on the basis of the probabilistic models. The observed document image is considered as a mixture of Gaussian densities which represents the foreground and background document image components. The EM algorithm is introduced in order to estimate and improve the parameters of the mixtures of densities recursively. The initial parameters of the EM algorithm are estimated by the k-means clustering method. After the parameter estimation, the document image is partitioned into text and background classes by the means of ML approach. The performance of the proposed approach is evaluated on a variety of degraded documents comes from the collections of the National library of Tunisia. 1. Introduction 2. Related work The automatic processing of degraded historical documents is a challenge in document image analysis field which is confronted with many difficulties due to the storage condition and the complexity of their content. For historical degraded and poor quality documents, enhancement is not an easy task. The main interest of an enhancement step of historical documents is to remove information coming from the background. Background artifacts can derive from many kinds of degradation, such as scan optical blur and noise, spots, underwriting, or overwriting. Most previous document image enhancement algorithms have been designed primarily for binarization. Binarization aim to extract text from distorted degraded documents and its related methods are proposed for processing gray documents which have not been extended for color documents. 978-0-7695-3725-2/09 $25.00 © 2009 IEEE DOI 10.1109/ICDAR.2009.220 According to the literature, approaches that deals with document image enhancement and text extraction are based on binarization or foreground/background separation techniques. Most previous document image enhancement algorithms have been designed primarily for binarization. Binarization is performed either globally or locally. Global thresholding methods are not sufficient since document images usually are degraded including shadows, non-uniform illumination and low contrast. Local methods are shown to perform better according to recent exhaustive survey of image binarization methods presented in [15]. Two main approaches are distinguished, based local thresholding methods and clustering based methods. Based local thresholding techniques have been proposed to estimate a different threshold for each 743 pixel according to the grey-scale information of the neighboring pixels. Some of these popular methods, namely Otsu's thresholding technique [13] locally adaptive technique [11,14]. Other adaptive methods specially designed for historical and distorted documents are based on adaptive threshold segmentation. Gatos et al. [8] presents a binarization methodology based on background estimation used to segment the image, various pre- and post-processing steps are needs in this approach. Oh et al. [12] presents an iterative algorithm based on water flow models and a hierarchical thresholding. This method deals with low contrasted documents images. An iterative approach for segmenting degraded document images is described by Kavallieratou et al [10]. It consist in obtaining an initial segmentation using a global thresholding and applying a local thresholding on the areas which are incorrectly segmented and detected. Other methods for historical document image enhancement are driven by the goal of improving human readability of the documents are Based clustering methods. These methods are dedicated for foreground/background separation of color document images using a classification approach. Garain et al. [7] have proposed an adaptive method for foregroundbackground separation in low quality color document images. A connected component labeling is initially implemented to capture the spatially connected similar color pixels. Next, Dominant background components are determined to divide the entire image into a number of grids which representing local uniformity in illumination background. Drira et al. [6] proposed a recursive method of unsupervised clustering. It classifies the pixels of document image in three clusters (background, original text, and show-through effect) in the degraded document. Thereafter the show-through effect must be eliminated and replaced by the color of the background. Agam et al. [2] have described a novel approach based on probabilistic models (EM) for foreground and background separation. This algorithm deals with low contrasted document images. The observed document image is considered as a mixture model of two Gaussian densities which represents the foreground (YF) and background (YB) document image components. The data of this model are determined by a random vector X of the probability density function (PDF) which is written as the following, equation 1: K F ( x,θ ) = ∑ π k f k ( x;θ k ) (1) k =1 Where: K =2: The number of densities assumed a priori. ( θ = µ YB , σ YB ) : The mean and the standard 1 deviation vectors of YB component. θ 2 = ( µ YF , σ YF ) : The mean and the standard deviation vectors of YF component. The PDF f1 ( x;θ1 ) and f 2 ( x;θ 2 ) are used for maximum likelihood clustering. Then, the estimation of θ1 and θ 2 parameters are performed using of the EM algorithm which need an initializing parameter step. This step is the first one of our segmentation algorithm detailed in the following section. 4. Proposed Method The proposed approach is considered as a novel view and improvement of our proposed methodology published in [3]. This approach belongs to our system PRAAD (Pre-processing and Analysis Tool for Arabic Ancient Document) [4]. Our approach process both color and grayscale document images. To apply our approach on color document image, we convert the image to YIQ colors space and operate on Y luminance channel. This choice is justified by the fact that the human vision is very sensitive to the change of luminosity. Moreover, the variation in light intensity caused by the uneven background of poor degraded is captured in Y channel. According to the poor quality of document images and the pale colors and degradations coming from the background artifacts which affect the foreground contrast, we apply a stretching to the intensity values of image histogram using a proportion value. Then, the image YC is produced with proportion between 2% and 8% which gives correct results. After this needed pre-processing step, the contrasted image is operated by the segmentation algorithm detailed in the following section. 3. The EM Algorithm The EM algorithm mentioned in [5] is an iterative algorithm for calculating the maximum-likelihood or maximum-a-posterior estimates when the observations can be viewed as incomplete data. Each iteration of the algorithm consists of an expectation step followed by a maximization step. 744 4.2. The segmentation algorithm µ̂ Our proposed text and background segmentation algorithm operates in three steps which are presented below. ( t +1) k ∑ zˆ H = ∑ zˆ n (t ) i i =1 ik n (t ) i =1 ik : Estimates a priori ∧ means µ k of k -th density of the mixture. 4.2.1. Initial estimation Initial estimates θ1 and θ 2 and their corresponding mean and standard deviation vectors (0) (0) (0) (0) ( µ YB , σ YB , µ YF , σ YF ) for EM algorithm are calculated using k-means clustering method presented in [3]. σˆ 2 n i +1 zˆ ik( t ) ( H i − µˆ k( t +1) ) 2 ∑i =1 zˆik(t ) n : Estimates ∧ Until Q(θ ;θˆ (t ) ) − Q(θ ;θˆ (t −1) ) < ε . 4.2.3. Maximum likelihood segmentation The image segmentation is carried out by the conventional ML method using θ ( EM ) . The ML method estimates the probability that a pixel belongs to its corresponding class which is text or background and assigns it when its probability is maximal. We are using the probability distributions of Raleigh law. According to the distribution, the likelihood is expressed in the following equations (According to the Raleigh law): ] 1 ∑ = covariance matrix σ k of k -th density of the mixture. 4.2.2. Iterative Estimation by EM algorithm The EM algorithm is iteratively carried out with the initial estimates θ ( 0 ) and the intensity histogram H of YC document image. The EM algorithm converges when difference of old estimates and new estimates are less than some threshold ε and the final estimates θ ( EM ) is obtained. The EM algorithm contributes to the segmentation algorithm by way of improving the parameters of the mixture of densities on the basis of the ML criterion. The EM algorithm in initialized as below. Algorithm EM Input: K=2 θ(0) = π1(0) ,π2(0) , µ1(0) ,σ1(0) , µ2(0) ,σ2(0) : estimates vectors by k-means algorithm. Where: − ( µ1( 0) , σ 1( 0 ) , µ 2( 0) , σ 2( 0) ) : Means and standard deviation vectors ( 0 ) ( − (π , π 0) ) = ( 1 , 1 ) [ ( t +1) k 1 f k =1, 2 (YC ) = µˆ k 2 π exp( − YC 2 ) 2 2 ˆ 2( µ k ) π The pixel (i,j) in YC is labeled Vk (i, j ) according to the following equation: 2 2 H a histogram vector defined previously ε the threshold for the algorithm convergence Output: ⎧Vk (i, j) =1 ⎨ ⎩Vk (i, j) = 0 θˆ (0) : Local Maximum of likelihood law Experimental works presented in our last work presented in [3], proved that for the case of degraded manuscripts, the Raleigh distribution gives better results for text-background segmentation. Figure 1 and Figure 2 show segmentation results of YC image both in grayscale and color space. t ←0 ; θˆ ( 0 ) = θ ( 0 ) : Model initialization Repeat (Expectation step) (t ) Compute the posterior probabilities zˆ i ,k . zˆi(,tk) = f k (YC(i, j)) = max(f1 (YC(i, j), f 2 (YC(i, j))) si non 5. Experimental Results πˆ k( t ) f ( H i ;θˆk(t ) ) ∑ si K πˆ l(t ) f ( H i ;θˆl( t ) ) (Maximization Step) As mentioned in the related work study, the document images enhancement methods are driven by the goal of improving human readability of document text. To evaluate the performance of our proposed method in enhanced text extraction, we achieve our tests on 100 scanned images of old degraded handwritten documents given by the National library of Tunisia [1]. l =1 π k , µ k ,σ k Estimates of πˆ k(t +1) = ∑ n i +1 zˆ (t ) i,k maximizing Q (θ ;θˆ (t ) ) : Estimates a priori n ∧ probability π k of k -th density of the mixture. 745 To assess the performance of our method especially for difficult cases, three degradation types were selected for evaluation and the images are visually inspected. The proposed method is evaluated using two sets of metrics. The first sets are based on two selected and correlated binarization criteria. The criteria are misclassification error (ME) and the relative foreground area error (RAE) proposed by Sezgin and Sankur [15]. In order to be calculated, these criteria need the ground-truth binary image which is manually obtained. The expected values of the above criteria vary between [0, 1]. In all cases, the measure that is closer to zero corresponds to the best binarization result. The analytical score values of the two criteria obtained for three types of degradations document image are shown in Table 1. The second sets used the precision and recall criteria presented in [16]. The two criteria are defined below and results are shown in Table 2. Precision = No. of correctly pixel's foreground extracted / Nbr of all pixel's foreground extracted by the proposed method Recall = No. of correctly pixel's foreground extracted / total Nbr of pixel's foreground present in the document. Where the number of pixel’s foreground present in the document is calculated by using the ground-truth image. Precision reflects the performance of removing the degradation and recall reflects the performance of extracting the foreground. These evaluations include comparative results between the proposed method and some known binarization algorithms [8, 11, 12, 13, 14]. Moreover an average performance score is computed for global decision. The result in Table 1 and Table 2 shows respectively that our method achieves the best average value in binarization criteria and in precision and recall criteria. Figure 1. ML Text background segmentation in Y channel: (a) Original degraded image; (b) Background, (c) Text, (d) bitonal image. Figure 2. ML Text background segmentation in RGB color space: (a) Original degraded image, (b) Background, (c) Text, (d) Reconstructed image. Table 1. Evaluation’ scores for two binarization criteria obtained by document distortion type Show-through effects ME RAE 0,114 0,119 0,035 0,038 0,005 0,005 0,047 0,035 0,008 0,009 0,003 0,003 Localized spots ME RAE 0,089 0,086 0,061 0,066 0,024 0,023 0,032 0,034 0,029 0,027 0,023 0,000 Uneven background. ME RAE 0,160 0,168 0,060 0,060 0,024 0,027 0,042 0,041 0,022 0,019 0,014 0,009 AVE Niblack 0,122 Sauvola 0,053 Otsu 0,018 Gatos 0,038 Oh et al. 0,019 Proposed Method 0,009 Table 2. Evaluation’ accuracy for precision and recall obtained by document distortion type 746 Show-through effects Niblack Sauvola Otsu Gatos et al. Oh et al. Proposed Method Precision 53% 99% 96% 75% 100% 100% Recall 96% 72% 100% 93% 93% 97% Localized spots Precision 66% 99% 89% 98% 97% 93% Recall 94% 63% 99% 81% 84% 93% Uneven background Precision 41% 96% 82% 96% 97% 96% Recall 95% 49% 100% 65% 83% 90% AVE Precision 53% 98% 89% 89% 98% 96% Recall 95% 61% 99% 79% 86% 93% EM Algorithm”, J. Royal Statistical Soc., Series B (Methodological), vol. 1, no. 39, pp. 1-38, 1977. [6] F. Drira, F. Le Bourgeois, H. Emptoz, “Restoring Ink Bleed-Through Degraded Document Images Using a Recursive Unsupervised Classification Technique”, Document Analysis Systems VII , Springer Berlin/Heidelberg, 2006, pp.38-49. [7] U. Garain, T. Paquet, L. Heutte, “On ForegroundBackground Separation in low Quality Document Images”, International Journal of Document Analysis 8(1): pp. 47–63, (2006). [8] B. Gatos, I. Pratikakis, S. J. Perantonis, “Adaptive degraded document image binarization”, Pattern Recognition 39 (3), pp. 317–327, 2006. [9] M. Junker, R. Hoch, and A. Dengel, “On the Evaluation of Document Analysis Components by Recall, Precision, and Accuracy”, Proc. Fifth Int’l Conf. Document Analysis and Recognition, pp. 713-716,1999. [10] E. Kavallieratou, E. Stamatatos, “Improving the Quality of Degraded Document Images”, Second International Workshop on Document Image Analysis for Libraries, 2006. [11] W. Niblack, “An introduction to digital image processing”, Prentice-Hall, Englewood Cliffs, NJ, pp. 115–116, 1986. [12] H.-H. Oh, K.-T. Lim, and S.-I. Chien, “An improved binarization algorithm based on a water flow model for document image with inhomogeneous backgrounds”, Pattern Recognition, 38(12): pp.2612–2625, 2005. [13] N. Otsu, “A threshold selection method from gray level histogram”, IEEE Transactions in Systems, Man, and Cybernetics, 1979, vol. 9, pp. 62-66. [14] J. Sauvola, M. Pietikainen, "Adaptive document image binarization", Pattern Recognition 33(2), pp. 225–236, 2000. [15] M. Sezgin, B. Sankur, “Survey over image thresholding techniques and quantitative performance evaluation”, J. Electron. Imaging, 13(1), pp.146–165, 2004. [16] C. L. Tan, R. Cao, P. Shen, “Restoration of Archival Documents Using a Wavelet Technique”, IEEE Trans. Pattern Anal. Mach. Intell., 24(10), pp. 1399-1404, 2002. 6. Conclusion and future work The developed enhanced text extraction algorithm operates firstly with a pre-processing step based on contrast adjustment of the document image. Then, this image is segmented into text and background components with the EM based segmentation algorithm. This algorithm is composed of three steps: (1) EM initializing, (2) EM estimation, (3) ML segmentation. According to the evaluated results, our method performs the best average values compared by the others methods. These values are about 0.009 mean errors in enhanced text extraction and accuracy about 96% rate for precision and 93% rate for recall. 7. Acknowledgement This research is carried out within the framework of the DAAD project ‘‘In the way of information society’’ and the research cooperation projects between Tunisia and German. The Authors thanks the National library of Tunisia and the National Archives of Tunisia [1] for the access to its large document images database of Arabic historical documents. References [1] National library of Tunisia http://www.bibliotheque.nat.tn. [2] G. Agam, G. Bal, G. Frieder, O. Frieder, “Degraded document image enhancement”, Proc. SPIE 6500, pp. C1–11, 2007. [3] W. Boussellaa, A. Zahour, and A. Alimi, “A methodology for the separation of foreground/background in Arabic historical manuscripts using hybrid methods”, Journal of Universal Computer Science, 14(2):284–298, 2008. [4] W. Boussellaa, A. Zhour, B. Taconet, A. Alimi, and A. Benabdelhafid, “PRAAD: Preprocessing and analysis tool for Arabic ancient documents”, In 9th International Conference on Document Analysis and Recognition, vol. 2, pp. 1058–1062, 2007. [5] A. P. Dempster, N. Laird, and D.B. Rubin, “Maximum Likelihood from Incomplete Data via the 747

RELATED PAPERS

RELATED TOPICS

Log In

Enhanced Text Extraction from Arabic Degraded Document Images using EM Algorithm

Enhanced Text Extraction from Arabic Degraded Document Images using EM Algorithm

Related Papers

RELATED PAPERS

RELATED TOPICS