1. Introduction
With the development of information technology, the processing of text has become increasingly important. As a key part of image information processing, scene text detection technology is attracting more and more attention from researchers. The scene text detection task mainly detects the position information of text regions from natural scenes. Accurate detection results facilitate the application of technologies such as text recognition, simultaneous translation, scene analysis, image retrieval, and autonomous driving.
Traditional object detection is based on human-designed features [
1,
2,
3] for detection, and the regions in the scene image that match these features are marked as text regions according to the designed features, thus distinguishing them from non-text regions. The detection results rely heavily on designed features, including color, gradient, texture, and other features. However, natural scene images often contain complex feature information, and the robustness of scene text detection with only one or a few artificially designed features is often not ideal. Deep learning, with its powerful feature learning capability, is widely used in machine learning research, and the features it learns tend to have a higher discriminative ability and are more robust, which makes the performance of text detection [
4] greatly improved. In general, text detection can be divided into two main categories: regression-based text detection and segmentation-based text detection. Regression-based text detection is mostly inspired by traditional object detection. It treats text as a special object and detects it by estimating its bounding box and continuously regressing to approximate the ground truth. However, it has certain limitations in fitting the edges of arbitrarily shaped text due to its characteristics. Segmentation-based text detection, on the other hand, classifies the pixels of an image from the pixel level to determine if a pixel belongs to text. Then, a post-processing algorithm is used to obtain a text box, and this method can fit text with arbitrary shapes better.
Although segmentation-based methods can better fit the text shape and solve the problem of arbitrarily shaped text detection to a certain extent, there are still many challenges and problems with scene text detection for arbitrary shapes, such as (1) natural scene pictures containing rich background information, which contain a large amount of interference information similar to the text, and there are huge differences in text shape, size, and aspect ratio. How the model can better learn text-related feature information is important for accurate scene text detection. (2) Coarse text boundary annotations: the general natural scene text datasets only provide coarse-grained text boundary annotation, the text region contains a large number of background pixels, and these misclassified pixels are called ambiguous samples, which will make the model learn a large amount of misinformation and reduce the detection performance of the model to some extent. (3) Text adhesion in scene images. How to separate adjacent text becomes another task.
To address the above problems, previous research has focused on how to extract features in scene images by using
regular convolutional kernels. Although certain results have been achieved, the regular convolutional kernels are often prone to introduce a large amount of noise on the detection of arbitrarily shaped text in regions containing a large number of background pixels and regions with strong interference, making the detection of the model ineffective. Meanwhile, the problem of fitting text shapes and the detection of small text still suffers from a high probability of false detections as well as missed detections, although some segmentation-based techniques try to reduce the dependence on pixel prediction through orientation fields [
5]. However, the performance of these methods in curved text detection is still unsatisfactory due to the unpredictability of the orientation. On adjacent text processing, PSENet [
6] solves the problem of text adhesion by multi-scale text kernel and progressive expansion, and this method leads to a long post-processing time. For the problem of coarse-grained text boundary annotations, previous segmentation-based approaches often convert the segmentation results into binary maps. Since the model learning process relies heavily on the pixel classification results, these misclassified pixels often have a serious impact on the performance of the model.
We consider that position information contributes positively to the learning of the model as well as semantic information. Therefore, in this paper, we propose a segmentation-based method that uses a combination of semantic and position information. In this approach, we propose the position-encoding module (PosEM) to add position-encoding information to the image, which allows the model to better learn the implicit relationships among features, such as the distribution relationships between text pixels, textures, and colors, through this position encoding information. Meanwhile, we propose the semantic enhancement module (SEM), in which we use irregular convolution kernels that better fit the characteristics of the text, and use such convolution kernels to enable the model to extract the semantic information in the image better while reducing the noise interference caused by the regular convolution kernels. These two modules enable the model to learn feature information related to text more effectively. Then we use the position-encoding information generated by the PosEM to cluster the text pixels and solve the problem of text adhesion, and the segmentation results are converted into a probability map that can more reasonably represent the text distribution to reduce the interferences caused by coarse-grained annotations and text pixel distribution characteristics.
The main contributions of this paper are as follows:
(1) We propose an efficient position-encoding module that enhances the robustness of the model by adding position-encoding information to the image so that the model can perceive the position information during the learning process to enhance the robustness of the model, by better learning the implicit relationships of feature information in the image.
(2) We propose the semantic information enhancement module, in which we enhance the extraction of semantic information, while being able to reduce the interference caused by unnecessary noise, and fuse the features with stronger semantic information with features containing position information obtain get improved feature information.
(3) We use the position encoding information to cluster text pixels, solve the text adhesion problem, and map text instances into more reasonable probability maps, reducing the interference of ambiguous samples due to coarse-grained annotations and text distribution characteristics.
(4) To demonstrate the effectiveness of our model on arbitrarily shaped text detection, we conducted experiments on several publicly available datasets. The experimental results show that our model is highly competitive and in the advanced ranks compared to previous methods.
2. Related Work
As was mentioned above, traditional scene text detection relies on manually designed features, so it does not perform well in scene text detection with complex feature information. With the rise of deep learning, its powerful learning ability can learn features of text well, which improves the effectiveness of text detection to a large extent. However, scene text detection still faces the problem of large differences in the size, aspect ratio, orientation, and shape of the text. Current research is mainly based on convolutional neural networks (CNNs) for text detection. First, it uses CNNs for feature extraction and fusion of the resulting multi-scale feature maps to compensate for the limitations imposed by a single feature map. Then, by calculating the confidence level and determining whether the pixel belongs to the text. Finally, text instances are obtained by post-processing algorithms. Current scene text detection methods are divided into two main categories: regression-based text detection and segmentation-based text detection.
The regression-based method for scene text detection is based on object detection algorithms such as faster R-CNN [
7], SSD [
8], and mask R-CNN [
9], which consider the text as a special kind of object. The text instances are predicted by continuously regressing on the bounding boxes. CTPN [
10] is improved based on Faster R-CNN by generating multiple vertical anchor boxes of fixed width and then combining the anchor boxes belonging to the same text instances into the final text box. The method is effective in detecting horizontally distributed text in natural scenes but is not effective in detecting non-horizontally distributed text. RRPN [
11] proposes a text detection method with a rotation angle by adding angle information to a region proposal network (RPN) and then mapping the RROI suggestion box to the feature map. The result is sent to the classifier for processing. the angle information of RRPN makes it well-suited for the detection of inclined text. EAST [
12] proposes an efficient end-to-end detection method, which achieves text detection in arbitrary directions by performing offset regression on the boundaries and vertices of the text. MOST [
13] uses deformable convolution to adjust the receptive field of the feature layer and proposes the Instance-wise IoU loss function. It solves the problem of insufficient receptive fields in detecting long text and poor detection of small-scale text. SegLink [
14] cuts each text instance into more easily detectable directed text segments. Then it determines whether they belong to the same text instance based on the connection information between text segments. To a certain extent, it solves the problem of detecting long texts and multi-directional texts. For long text and curved text problems, LOMO [
15] proposes iterative modules and shape expression modules to obtain more accurate text region locations. ContourNet [
16] is designed with an adaptive RPN that effectively handles the problem of large-scale variation and achieves finer text region localization. Nevertheless, the regression-based approach still has some limitations in arbitrarily shaped scene text detection.
Segmentation-based scene text detection methods predict at the pixel level. Each pixel is classified to determine whether the pixel belongs to a text object and its connectivity with surrounding pixels to obtain a mask for the text. The pixels are then clustered by post-processing to reconstruct the text instances. PixelLink [
17] predicts the connection for eight neighboring pixels of each pixel and determines whether they belong to the same text instance based on the prediction result. PSENet sets different scales of the kernel for text regions and gradually expands the detected text regions by using the progressive expansion method. This effectively solves the problem of separating adjacent text instances in the process of reconstructing text instances. To address the problem of the high time overhead of PSENet post-processing, PAN [
18] proposed a new pixel clustering method with faster prediction speed while maintaining high precision. SPCNet [
19] proposes a text context module to complement the contextual information while using a re-score mechanism to effectively suppress false positives. DBNet [
20] proposed a module named differentiable binarization, which solves the problem that standard binarization cannot be optimized and learned together with the network due to its non-differentiability. This design enables the self-adaptive setting of binarization thresholds for the network, which improve the detection performance greatly.
3. Proposed Method
The overall structure of our proposed model is shown in
Figure 1.
We use the positional encoding module (PosEM) to get the absolute positional encoding information of the image, and then feed the absolute positional encoding together with the image to the backbone network for feature extraction, and calculate the relative position encoding. In order to make full use of the multi-layer feature information, we use the feature pyramid network [
21] (FPN) to fuse the multi-scale feature layer information. The semantic enhancement module (SEM) is then used to obtain feature maps with strong semantic information and fused them to enhance the semantic information in the feature maps while using iteration to strengthen the process. Finally, the positional information generated by PosEM is used to cluster pixels to reconstruct text instances and produce probability maps that can more reasonably represent the text distribution.
3.1. Positional Encoding Module (PosEM)
The structure of PosEM is shown in
Figure 2.
In this part of our work, the following analysis was performed.
Compared to the fully connected network [
22] (FCN), the convolutional kernel of the convolutional neural network (CNN) can be considered as a local filter that improves the efficiency of image processing through local links in a limited space. Precisely for this reason, although the convolutional kernel can perceive information about the current region, this local link is usually not considered to be able to perceive the current position where it is located.
The image contains rich and complex information, and the distribution of this information is extremely uneven. The effective information contained in each position is not the same for different tasks. The scene text detection task focuses more on learning the information and the features of the text. Through our analysis, we found that although the distribution of text regions in images is irregular, the distribution of features such as text pixels, textures, and colors in text regions is unique, for example, pixels in the center of text regions have a high probability of being text pixels, while pixels in edge regions have a low probability of being text pixels. It is because the edge region contains more background information due to inaccurate annotations of text and the text pixel distribution characteristics, and the probability that pixels in these regions are text pixels is low, but they are also segmented as text regions. These ambiguous samples will cause interference with the model learning.
Based on the above analysis, we added the PosEM to the CNN to help the model learn relationships in text features more effectively and minimize the impact of other interfering information in the image. Due to the uncertain size of the image, the approach of positional encoding is required to be extensible, so we adopt a scalable and efficient way to generate absolute positional encoding for the image. In this encoding operation, we resolve the position coordinates from two dimensions. For any point
in the image, its position information is encoded as
where
denotes the generated position encoding and
is the position encoding function. and the position information is combined with the original image as the input of the network:
where
so that the CNN perceives position information during the process of convolution and thus learns the implied relationships in the feature distribution of the image. Such encoding can provide clear position information with little additional computational cost.
Moreover, to avoid the noise from inaccurate text annotation, etc., on model learning, we treat the text region as a probability map instead of a binary map. To generate the corresponding probability maps, we construct relative position encoding for arbitrarily shaped text. The construction method is as follows: we first construct the axis line by dividing the boundary points of the long edge of the text area into two point sets
and
, and generate the positioning connection line
, determine the axis line positioning points according to the edge points and the positioning connection line, and connect the positioning points to form the axis line. To make the generated position encoding more reasonable, we scaled the axis lines with a shrinkage scale as follows:
where
is the area of the polygon region,
is the perimeter of the polygon region,
is the shrinkage ratio, which is empirically set to
, and shrink the axis line inward at both ends by
. Then we divide the text region into multiple quadrilateral
and calculate the distances
and
of the pixels in the quadrilateral
relative to the axis and the boundary, respectively, using absolute position encoding, where the belonging of the pixels is defined by the quadrilateral, and since they are relative distances, we call this distance information relative position information (as is shown in
Figure 3).
where
,
are the two endpoints of the axis of
,
, and
is the foot point of
on
.
3.2. Semantic Enhancement Module (SEM)
In the segmentation-based scene text detection method, the performance of segmentation can be significantly improved with as comprehensive and accurate semantic information as possible. Therefore, in this module, we focus on how to obtain more accurate semantic information. The architecture of the SEM is shown in
Figure 4. We use irregular [
23] convolutional kernels, feature information fusion, and iterative methods [
24] to obtain more semantic information. This process is strengthened to obtain more complete feature information.
First, we used non-standard and convolution kernels. Because we note that different from instance objects in general object detection, text instances usually have a large aspect ratio, this feature makes the standard convolution kernel limited in scene text detection. This phenomenon arises because the square convolutional kernel expands the receptive field while introducing a large amount of noise, which has an impact on the learning effect of the model. If we want to reduce the effect of the noise introduced by the square convolutional kernel, the size of the receptive field will be limited, which imposes a limitation on the model for the detection of long texts. The non-standard convolution kernel is more compatible with the characteristics of text instances in terms of shape and size, so we use this non-standard convolution kernel for feature extraction of text instances. This convolutional kernel has several advantages. On the one hand, the features extracted by this convolutional kernel have stronger semantic information. On the other hand, it also avoids the effect of the noise introduced by the standard square convolutional kernel to expand the receptive field, and it also reduces the learning parameters of the model.
Next, to further fuse the semantic and position information to obtain more complete features, we superimposed the feature map with strong semantic information with the feature map with implicit position information.
The feature extraction process was subsequently enhanced using an iterative approach.
3.3. Post-filtering Algorithm
To solve text adhesion and reduce the interference caused by inaccurate annotations of text boundaries and distribution characteristics of text pixels, we reconstruct text instances by clustering text pixels based on relative position encoding and use a probability value conversion function,
where
set to
to convert the binary map into a more reasonable probability map representing text probabilities.
Considering that the threshold filtering algorithm under a single condition has large limitations, and is prone to misclassify text instances with strong background interference, we propose to use both the average confidence and area of the predicted text instance regions to design the post-filtering algorithm. This filtering algorithm integrates two plausible conditions for text instances and can effectively avoid misclassification in the case of a single condition.
3.4. Loss Function
Our loss function is expressed as the sum of the losses of multiple quadrilaterals in a text instance.
where
is the weight of the loss in the corresponding quadrilateral region and the sum of all
is equal to 1.
The loss in each quadrilateral region is the Mean Square Error (MSE) of all pixels in the quadrilateral,
where
is the corresponding quadrilateral in the ground truth and
is the quadrilateral in the prediction graph, and v represents the pixels in
. To overcome the problem of unbalanced positive and negative samples, we adopt an online hard example mining [
25] (OHEM) strategy for loss.
4. Experiments
In this section, we summarize the public datasets used for the experiments and our experimental configurations. We conducted relevant ablation experiments on these public datasets and we compared the effectiveness with other detection algorithms through comparison experiments. Three main performance parameters, precision (P), recall (R), and F-measure (F), are considered to evaluate the detection performance of the model.
4.1. Datasets
To fully validate the performance of our model, we validate our model on the following challenging public datasets.
Total-Text [
26] is a publicly available dataset consisting of 1555 scene images, with a train set containing 1255 images and a test set containing 300 images. This dataset uses polygonal annotation boxes because these scene images contain much multi-directional text and curved text. The images also contain a large amount of background noise that is similar to the text regions.
The CTW1500 [
27] dataset consists of 1500 scene images, of which the train set consists of 1000 scene images and the rest 500 are the test set. Similarly, it contains a large number of arbitrarily shaped text in the images, and the text regions are marked with polygonal bounding boxes.
The MSRA-TD500 [
28] dataset contains 500 scene images, in which the text includes Chinese and English. The train set is 300 images randomly selected from the original dataset, and the remaining 200 images constitute the test set. The diversity of their text sizes and text spacing, as well as having a similar background to the text, make this dataset a challenge.
4.2. Implementation Details
The experiments in this paper were conducted in a Linux environment, using Python 3.7 as the programming language, Pytorch 1.2 as the implementation framework, CPU: Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz; GPU: Tesla V100 16G. All the following experiments were conducted on a single GPU.
We used ResNet-50 [
29] which pre-trained on the ImageNet dataset, as our backbone and the rest as described in the paper. The network uses stochastic gradient descent [
30] (SGD) as the optimizer, with the initial learning rate set to 0.01 and the current learning rate dynamically adjusted to the initial learning rate multiplied by
where
is the current number of iterations. Our model is simply pre-trained on the SynthText [
31] dataset, and all datasets are officially provided. We resized the image size to 800
800 and set the batch to 4 for 800 epochs on the Total-Text dataset. On the CTW1500 dataset, we resized the image size to 640
640 and set the batch to 6, and conducted 800 epochs of training. On the MSRA-TD500 dataset, we resized the images to 640
640 and set the batch to 6 for 900 epochs. To improve the robustness of the model, we performed data augmentation on the training data by random rotation, random cropping, and random flipping.
4.3. Ablation Experiments
We conducted ablation experiments on the publicly available datasets MSRA-TD500 dataset and Total-Text dataset to verify the effectiveness of our proposed PosEM and SEM. First, we verified the experimental effect of the baseline without including PosEM and SEM, then we verified the effect of adding only PosEM based on the baseline, and adding only SEM based on the baseline, respectively. Finally, we validated the complete model with the addition of the PosEM and SEM. Note that the “baseline” mentioned in our experiments refers to the model reproduced from the PSENet open-source repository. The results are shown in
Table 1, where the image sizes are resized to 800 and 640, respectively.
From
Table 1, we can see that our proposed PosEM can effectively improve the performance of Resnet-50. PosEM improves precision, recall, and F-measure by 1.46%, 2.92%, and 1.56%, respectively, on the Total-Text dataset. PosEM improved precision, recall, and F-measure by 3.73%, 2.95%, and 1.56%, respectively, on the MSRA-TD500 dataset. This is because PosEM can help the model to better learn the implicit relationship of feature information, which helps the model to better identify text regions. At the same time, the location information is used to supervise the learning of the model, which reduces the interference of noise in the text region on the model learning.
As shown in
Table 1, the SEM module can help the model to better extract the semantic information of the text area, which effectively enhances the model’s ability to discriminate between text and non-text areas. Therefore, SEM also brings a certain gain effect to the performance of the model. On the Total-Text dataset, the accuracy and F-measure are improved by 1.73% and 1.15%, respectively. On the MSRA-TD500 dataset, the accuracy and F-measure are improved by 4.53%and 1.81%respectively.
Overall, the precision, recall, and F-measure of the baseline algorithm on the Total-Text dataset are 86.28%, 80.44%, and 83.25%, respectively, while our method improves 3.01%, 2.95%, and 2.99% in the three metrics to 89.29%. 83.39% and 86.24%. On the MSRA-TD500 dataset, the precision, recall, and F-measure of the baseline algorithm are 83.25%, 80.93%, and 82.07%, respectively, while our method improves the three metrics by 6.28%, 2.39%, and 4.24%, reaching 89.53%, 83.32%, and 86.31%. It can be seen that our performance is significantly better than the baseline algorithm.
4.4. Comparison Experiments
In this section, to fully validate the performance of our model, we compare it with previous methods on several publicly available datasets including Total-Text, MSRA-TD500, and CTW1500, as shown in
Table 2,
Table 3 and
Table 4. We keep the aspect ratio of images and resize them to the appropriate size for testing.
As seen in
Table 2, on the Total-Text dataset, our algorithm improves 5.6%, 5.48%, and 5.55% in precision, recall, and F-measure, respectively, compared to PSENet, a segmentation algorithm that also uses ResNet-50 as its backbone network. In terms of precision and F-measure, our algorithm outperforms other algorithms by at least 0.32% and 0.42%. This is due to our algorithm’s ability to fit text boundaries better and its ability to discriminate textual regions from non-textual regions. Although we have some advantages in precision and F-measure, our algorithm recall is 1.49% lower than the best performance. This is caused by the large character spacing of the text.
As seen in
Table 3, on the MSRA-TD500 dataset, the F-measure of our algorithm is 86.73%, which is only 0.47% lower than the DBNet++ algorithm with the best performance. Compared to other algorithms such as PAN, DBNet, and DRRG, our F-measure is at least 1.65% higher. This shows that our algorithm is still competitive in multilingual and multi-directional text detection.
As seen in
Table 4, the accuracy of our algorithm is 86.91% on the CTW1500 dataset, which is only 0.99% lower than DBNet++, which has the best performance. Our F-measure value of 83.59% is generally consistent with the TextMountain, ContourNet, and DBNet algorithms. It is 1.81% lower in F-measure compared to the best performance. This is because CTW1500 contains many texts with large character spacing, which makes the algorithm incomplete in fitting the text edges due to the limited receptive field. Nevertheless, our algorithm still outperforms PSENet, a segmentation algorithm that also uses ResNet-50 as its backbone network, in terms of overall performance.
As shown in
Table 2,
Table 3 and
Table 4, in terms of the inference speed of the model, the FPS of our model is greater than 25, which means that our algorithm is nearly real-time .
4.5. Visualization Results
We visualize the detection results and compare them with those of other methods, as shown in
Figure 5,
Figure 6 and
Figure 7.
We visualize the detection results and can see that our method can fit the edges of text shapes well in the detection of arbitrarily shaped text. We further validate the superiority of our model by comparing the detection results with those of other methods, as shown in
Figure 5,
Figure 6 and
Figure 7.
Figure 5 demonstrates that our method detects better in regions similar to the text and regions with strong interference.
Figure 6 demonstrates that our method works well on the problems of adjacent text separation and large-scale difference text detection.
Figure 7a,b show that our model also works well for abstract word and multi-angle shot text detection.
Figure 7c,d demonstrates that our model can correctly detect even text that is overlooked during text annotation. It can be seen that our model still has strong robustness in complex situations.