A Laplacian Method for Video Text Detection

trung phan; P Shivakumara

2009 10th International Conference on Document Analysis and Recognition A Laplacian Method for Video Text Detection Trung Quy Phan, Palaiahnakote Shivakumara and Chew Lim Tan School of Computing, National University of Singapore {phanquyt, shiva, tancl}@comp.nus.edu.sg contrast to the background in order to detect the edges. So these methods often encounter problems with complex backgrounds and produce many false positives. Finally, the third approach considers text as a special texture and thus, uses fast Fourier transform, discrete cosine transform, wavelet decomposition and Gabor filters for feature extraction. However, these methods require extensive training and are computationally expensive for large databases. In this paper, we consider three existing methods [7, 8, 15] for comparative study. Liu et al. [7] extract edge features by using the Sobel operator. This method is able to determine the accurate boundary of each text block. However, it is sensitive to the threshold values for edge detection. Wong et al. [8] compute the maximum gradient difference values to identify candidate text regions. This method has a low false positive rate but uses many threshold values and heuristic rules. Therefore, it may only work well for specific datasets. Finally, Mariano et al. [15] perform clustering in the L*a*b* color space to locate uniformcolored text. Although it is good at detecting low contrast text and scene text, this method is extremely slow and produces many false positives. We propose a text detection method which consists of three steps: text detection, boundary refinement and false positive elimination. In the first step, we identify candidate text regions by using the Laplacian operator. The second step uses projection profile analysis to determine the accurate boundary of each text block. Finally, false positives are removed based on geometrical properties. Experimental results show that the proposed method outperforms the above three methods in terms of detection and false positive rates. Abstract In this paper, we propose an efficient text detection method based on the Laplacian operator. The maximum gradient difference value is computed for each pixel in the Laplacian-filtered image. K-means is then used to classify all the pixels into two clusters: text and non-text. For each candidate text region, the corresponding region in the Sobel edge map of the input image undergoes projection profile analysis to determine the boundary of the text blocks. Finally, we employ empirical rules to eliminate false positives based on geometrical properties. Experimental results show that the proposed method is able to detect text of different fonts, contrast and backgrounds. Moreover, it outperforms three existing methods in terms of detection and false positive rates. 1. Introduction There is an increasing number of video databases on the Internet and a reliable source of information for searching and retrieval is the text that appears in the videos. Video text consists of two types: graphic text and scene text. Graphic text is artificially added to the video during the editing process. Scene text appears naturally in the scenes captured by the camera. Although many methods have been proposed over the past years, text detection is still a challenging problem because videos often have low resolution and complex backgrounds and text can be of different sizes, styles and alignments. In addition, scene text is usually affected by lighting conditions and perspective distortions [1 – 3]. Text detection methods can be classified into three approaches: connected component-based [4], edgebased [5 – 9] and texture-based [10 – 14]. The first approach does not work well for all video images because it assumes that text pixels in the same region have similar colors or grayscale intensities. The second approach requires text to have a reasonably high 978-0-7695-3725-2/09 $25.00 © 2009 IEEE DOI 10.1109/ICDAR.2009.153 2. Proposed method 2.1. Text detection Text regions typically have a large number of discontinuities, e.g. transitions between text and 66 1 1 1 1 -8 1 1 1 1 Figure 1. The 3 × 3 Laplacian mask. Figure 3. Sample profiles of text and non-text regions. Each graph shows the positive and negative values of the middle row of the corresponding Laplacian-filtered image (not shown here). (a) (b) (c) (d) values. Let the two clusters returned by K-means be C1 (cluster mean M1) and C2 (cluster mean M2). Since the cluster order varies for different runs, we have the following rule to identify the text cluster. If M1 > M2, C1 is the text cluster; otherwise, C2 is the text cluster. This is because it is expected that text regions have larger MGD values than non-text regions. At the end of this step, each connected component in the text cluster is a candidate text region (Figure 2d). Figure 2. The text detection step. (a) Input image. (b) Laplacian-filtered image. (c) Maximum gradient difference map. (d) Text cluster. 2.2 Boundary refinement It is difficult to determine the boundary of each text block directly from the text cluster because of false positives and connected text lines (Figure 4b). Therefore, we compute the binary Sobel edge map SM of the input image (only for text regions) (Figure 4c). The horizontal projection profile is defined as follows. background. Therefore, the input image is converted to grayscale and filtered by a 3 × 3 Laplacian mask to detect the discontinuities in four directions: horizontal, vertical, up-left and up-right (Figure 1). Because the mask produces two values for every edge, the Laplacian-filtered image contains both positive and negative values. The transitions between these values (the zero crossings) correspond to the transitions between text and background. In order to capture the relationship between positive and negative values, we use the maximum gradient difference (MGD), defined as the difference between the maximum and minimum values within a local 1 × N window [8]. The MGD value at pixel (i, j) is computed from the Laplacian-filtered image f as follows. HP(i) = ∑ SM (i, j ) (2) j If HP(i) is greater than a certain threshold, row i is part of a text line; otherwise, it is part of the gap between different text lines. From this rule, we can determine the top row i1 and bottom row i2 of each text line. The vertical projection profile is then defined as follows. VP( j ) = ∑ SM (i, j ) i2 MGD (i, j ) = max( f (i, j − t )) − min( f (i, j − t )) (1) (3) i =i1 where t ∈ ⎡− N − 1 , N − 1⎤ . The MGD map is obtained ⎢⎣ 2 ⎥⎦ 2 by moving the window over the image. In Figure 2c, brighter colors represent larger MGD values. Text regions typically have larger MGD values than non-text regions because they have many positive and negative peaks (Figure 3). Therefore, we normalize the MGD map to the range [0, 1] and use K-means to classify all the pixels into two clusters, text and nontext, based on the Euclidean distance between MGD Similarly, if VP(j) is greater than a certain threshold, column j is part of a text line; otherwise, it is part of the gap between different words. Finally, different words on the same text line are merged if they are close to each other. By applying this step recursively, we can determine the accurate boundary of each text block, even when the text blocks are not well-aligned or when one candidate text region contains multiple text lines. At the end of this step, each detected block is a candidate text block (Figure 4d). 67 (a) (b) (c) (d) Figure 4. The boundary refinement step. (a) Input image. (b) Text cluster. (c) Sobel edge map. (d) Text blocks. (a) Input (b) Edge-based (c) Gradient-based (d) Uniform-colored (e) Proposed (f) Input (from [16]) (g) Edge-based (h) Gradient-based (i) Uniform-colored (j) Proposed 2.3. False positive elimination We eliminate false positives based on geometrical properties. Let W, H, AR, A and EA be the width, height, aspect ratio, area and edge area of text block B. AR = W / H (4) A =W ×H (5) EA = ∑ SM (i, j ) (6) ( i , j )∈B If AR < T1 or EA / A < T2, the candidate text block is considered as a false positive; otherwise, it is accepted as a text block. The first rule checks whether the aspect ratio is below a certain threshold. The second rule assumes that a text block has a high edge density due to the transitions between text and background. Figure 5. The detected text blocks of the three existing methods and the proposed method for input images (a) and (f). 3. Experimental results As there is no standard dataset available, we have selected 101 video images, extracted from news programmes, sports videos and movie clips, for our own dataset. There are both graphic text and scene text of different languages, e.g. English, Chinese and Korean, in the dataset. The image sizes range from 320 × 240 to 816 × 448. The parameter values are empirically determined: N = 5, T1 = 0.5 and T2 = 0.1. For comparison purpose, we have implemented three existing methods [7, 8, 15]. Method [7], denoted as edge-based method, extracts edge features by using the Sobel operator. Method [8], denoted as gradientbased method, computes the MGD values to identify candidate text regions. Finally, method [15], denoted as uniform-colored method, performs clustering in the L*a*b* color space to locate the text lines. 3.1 Sample results Figure 5 shows some sample results of the three existing methods and the proposed method. Image (a) has two low contrast text blocks. The edge-based method fails to detect any text block because of the problem of fixing threshold values for edge detection. The gradient-based method detects the text blocks with missing characters and inaccurate boundary. This method uses many threshold values and heuristic rules and thus, may only work well for specific datasets. The 68 uniform-colored method detects the text blocks with missing characters and produces many false positives due to the problem of color bleeding. The proposed method detects all the blocks correctly and even picks up the low contrast logo of the television channel. Image (f) has both graphic text (at the bottom left and right corners) and scene text (at the top). The edgebased method detects the graphic text but misses the scene text. The gradient-based method also misses some graphic text (the first graphic text line) and scene text (the two scene text blocks at the top left and right corners). The uniform-colored method produces many false positives. The proposed method detects all the text blocks correctly, except one at the top right corner. One of the billboards on the road is also detected. Figure 6 shows an image where the proposed method fails to detect some text blocks. The red text on the blue background is not detected because there is very low contrast between these two colors in the grayscale domain. The edge-based method and the gradient-based method have the same problem because they also use the grayscale image. By using the color information, the uniform-colored method is able to detect one of the two red text lines (“Life Alert”). Figure 7 shows the results of the proposed method for two different window sizes. A small window size gives a low false positive rate but might miss some low contrast characters (on the third line) (image (b)). On the other hand, a large window size helps to recover missing characters but also includes more false positives (image (c)). In our experiment, N is set to 5. (a) Input (b) Proposed (c) Edge-based (d) Gradient-based (e) Uniform-colored Figure 6. The proposed method fails to detect some text blocks because the contrast is too low. (a) Input 3.2 Results on the dataset (b) N = 5 (c) N = 21 Figure 7. Results of different window sizes. We define the following categories for each detected block by a text detection method. • Truly Detected Block (TDB): A detected block that contains a text line, partially or fully. • False Detected Block (FDB): A detected block that does not contain text. • Text Block with Missing Data (MDB): A detected block that misses some characters of a text line (MDB is a subset of TDB). For each image in the dataset, we manually count the Actual Text Blocks (ATB), i.e. ground truth data. The performance measures are defined as follows. • Detection Rate (DR) = TDB / ADB • False Positive Rate (FPR) = FDB / (TDB + FDB) • Misdetection Rate (MDR) = MDB / TDB Tables 1 and 2 show the performance of the three existing methods and the proposed method on the dataset. The proposed method has the highest DR and lowest FPR. It outperforms the edge-based method and the uniform-colored method in all the performance measures. Compared to the gradient-based method, the proposed method has better DR and FPR but worse MDR. However, the slightly higher MDR might be compensated by the significant difference in DR between the two methods. If we consider the number of fully detected text blocks, i.e. text blocks which do not have any missing character, the proposed method detects 458 – 55 = 403 blocks while the gradient-based method only detects 349 – 35 = 314 blocks. Therefore, the proposed method has achieved better detection results than the three existing methods on the dataset. 69 [3] K. Jung, K.I. Kim and A.K. Jain, “Text information extraction in images and video: a survey”, Pattern Recognition, 37, 2004, pp. 977-997. Table 1. Results on the dataset. Method Edge-based [7] Gradient-based [8] Uniform-colored [15] Proposed ATB 491 491 491 491 TDB 393 349 252 458 FDB 86 48 95 39 MDB 79 35 94 55 [4] A.K. Jain and B. Yu, “Automatic Text Location in Images and Video Frames”, Pattern Recognition, Vol. 31(12), 1998, pp. 2055-2076. Table 2. Performance on the dataset. Method Edge-based [7] Gradient-based [8] Uniform-colored [15] Proposed DR 80.0 71.1 51.3 93.3 FPR 18.0 12.1 27.4 7.9 [5] M. Anthimopoulos, B. Gatos and I. Pratikakis, “A Hybrid System for Text Detection in Video Frames”, The Eighth IAPR Workshop on Document Analysis Systems (DAS2008), Nara, Japan, September 2008, pp 286-293. MDR 20.1 10.0 37.3 12.0 [6] M. R. Lyu, J. Song and M. Cai, “A Comprehensive Method for Multilingual Video Text Detection, Localization, and Extraction”, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 15, No. 2, February 2005, pp 243-255. 4. Conclusion and future work We have proposed an efficient method for text detection based on the Laplacian operator. The gradient information helps to identify the candidate text regions and the edge information serves to determine the accurate boundary of each text block. Experimental results show that the proposed method outperforms the three existing methods in terms of detection and false positive rates. In the future, we plan to extend this method to text of arbitrary orientation. Currently, the text detection step can show white patches even for non-horizontal text (Figure 8). However, the refinement step is only able to detect the boundary for horizontal text because of the use of horizontal and vertical projection profiles. [7] C. Liu, C. Wang and R. Dai, “Text Detection in Images Based on Unsupervised Classification of Edge-based Features”, ICDAR 2005, pp. 610-614. [8] E. K. Wong and M. Chen, “A new robust algorithm for video text extraction”, Pattern Recognition 36, 2003, pp. 1397-1406. [9] P. Shivakumara, W. Huang and C. L. Tan, “An Efficient Edge based Technique for Text Detection in Video Frames”, The Eighth IAPR Workshop on Document Analysis Systems (DAS2008), Nara, Japan, September 2008, pp 307-314. [10] Y. Zhong, H. Zhang and A.K. Jain, “Automatic Caption Localization in Compressed Video”, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 22, No. 4, 2000, pp. 385-392. [11] K. L Kim, K. Jung and J. H. Kim, “Texture-Based Approach for Text Detection in Images using Support Vector Machines and Continuous Adaptive Mean Shift Algorithm”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 25, No. 12, December 2003, pp 1631-1639. [12] Q. Ye, Q. Huang, W. Gao and D. Zhao, “Fast and robust text detection in images and video frames”, Image and Vision Computing 23, 2005, pp. 565-576. Figure 8. The text detection step is able to show white patches for non-horizontal text. [13] H. Li, D. Doermann and O. Kia, “Automatic Text Detection and Tracking in Digital Video”, IEEE Transactions on Image Processing, Vol. 9, No. 1, January 2000, pp 147-156. 5. Acknowledgment This research is supported in part by IDM R&D grant R252-000-325-279. [14] W. Mao, F. Chung, K. K. M. Lam and W. Siu, “Hybrid Chinese/English Text Detection in Images and Video Frames”, ICPR, Volume 3, 2002, pp 1015- 1018. 6. References [15] V. Y. Mariano and R. Kasturi, “Locating UniformColored Text in Video Frames”, 15th ICPR, Volume 4, 2000, pp 539-542. [1] J. Zang and R. Kasturi, “Extraction of Text Objects in Video Documents: Recent Progress”, The Eighth IAPR Workshop on Document Analysis Systems (DAS2008), Nara, Japan, September 2008, pp 5-17. [16] X. S. Hua, W. Liu and H. J. Zhang, “Automatic Performance Evaluation for Video Text Detection”, ICDAR, 2001, pp 545-550. [2] J. Zhang, D. Goldgof and R. Kasturi, “A New EdgeBased Text Verification Approach for Video”, ICPR, December 2008, pp 1-4. 70

RELATED PAPERS

RELATED TOPICS

Log In

A Laplacian Method for Video Text Detection

A Laplacian Method for Video Text Detection

Related Papers

RELATED PAPERS

RELATED TOPICS