Abstract
Recently, segmentation-based methods have quickly become the mainstream in scene text detection, owing to their precise description of arbitrary-shape texts. However, the reduced inference speed hinders the practical application of segmentation-based methods. In this paper, we propose an efficient and accurate arbitrary-shaped text detector named ViT-Bilateral DBNet, which improves the efficiency of feature processing approach to achieve a good trade-off between accuracy and real-time performance. Specifically, we first combine Differentiable Binarization (DB) with real-time semantic segmentation BiSeNet V2 which is more suitable to process features for segmentation-based methods. Then three improvements are proposed to optimize the initial integrated network. ViT-Bilateral Network can strengthen the feature extracting capability of neural networks. Attention-driven Aggregation Layer (AAL) can adaptively fuse the details and the semantics achieved by ViT-Bilateral Network. Meanwhile, the auxiliary loss is added to make the training more sufficient. Compared with original DBNet, our method not only gains 1.17% (on IC15) and 1.34% (on CTW 1500) improvements, but also runs 1.38 times and 1.34 times faster. Notably, our detector surpasses the previous best record and maintains a high inference speed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Long, S., He, X., Yao, C.: Scene text detection and recognition: the deep learning era. Int. J. Comput. Vis. (IJCV) 129, 161–184 (2021). https://doi.org/10.1007/s11263-020-01369-0
Bonechi, S., Andreini, P., Bianchini, M., Scarselli, F.: COCO_TS dataset: pixel–level annotations based on weak supervision for scene text segmentation. In: Tetko, I.V., Kůrková, V., Karpov, P., Theis, F. (eds.) ICANN 2019. LNCS, vol. 11729, pp. 238–250. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30508-6_20
Liao, M., Wan, Z., Yao, C., et al.: Real-time scene text detection with differentiable binarization. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 11474–11481 (2020)
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Lin, T.Y., Dollár, P., Girshick, R., et al.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2117–2125 (2017)
Chen, Q., Wang, Y., Yang, T., et al.: You only look one-level feature. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Yu, C., Gao, C., Wang, J., et al.: BiSeNet V2: bilateral network with guided aggregation for real-time semantic segmentation. Int. J. Comput. Vis. (IJCV) 129, 3051–3068 (2021). https://doi.org/10.1007/s11263-021-01515-2
Tian, Z., Huang, W., He, T., He, P., Qiao, Y.: Detecting text in natural image with connectionist text proposal network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 56–72. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_4
Liao, M., Shi, B., Bai, X., et al.: Textboxes: a fast text detector with a single deep neural network. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)
Zhou, X., Yao, C., Wen, H., et al.: East: an efficient and accurate scene text detector. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 5551–5560 (2017)
Shi, B., Bai, X., Belongie, S.: Detecting oriented text in natural images by linking segments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2550–2558 (2017)
Wang, W., Xie, E., Li, X., et al.: Shape robust text detection with progressive scale expansion network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9336–9345 (2019)
Wang, W., Xie, E., Song, X., et al.: Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), pp. 8440–8449 (2019)
Paszke, A., Chaurasia, A., Kim, S., et al.: ENet: a deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147 (2016)
Zhao, H., Qi, X., Shen, X., Shi, J., Jia, J.: ICNet for real-time semantic segmentation on high-resolution images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 418–434. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_25
Li, H., Xiong, P., Fan, H., et al.: DFANet: deep feature aggregation for real-time semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9522–9531 (2019)
Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: BiSeNet: bilateral segmentation network for real-time semantic segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 334–349. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_20
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16×16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Liu, Z., Lin, Y., Cao, Y, et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ECCV), pp. 10012–10022. IEEE (2021)
Zhou, J., Wang, P., Wang, F., et al.: ELSA: enhanced local self-attention for vision transformer. arXiv preprint arXiv:2112.12786 (2021)
Liu, Z., Mao, H., Wu, C.Y., et al.: A ConvNet for the 2020s. arXiv preprint arXiv:2201.03545 (2022)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141 (2018)
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_1
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2315–2324 (2016)
Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., et al.: ICDAR 2015 competition on robust reading. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1156–1160 IEEE (2015)
Yuliang, L., Lianwen, J., Shuaitao, Z., et al.: Detecting curve text in the wild: new dataset and new solution. arXiv preprint arXiv:1712.02170 (2017)
Acknowledgments
The work is supported by National Natural Science Foundation of China (No. 52105528).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, Y., Shi, Y., Lin, C., Hua, J., Huang, Z. (2022). Efficient and Accurate Text Detection Combining Differentiable Binarization with Semantic Segmentation. In: Pimenidis, E., Angelov, P., Jayne, C., Papaleonidas, A., Aydin, M. (eds) Artificial Neural Networks and Machine Learning – ICANN 2022. ICANN 2022. Lecture Notes in Computer Science, vol 13531. Springer, Cham. https://doi.org/10.1007/978-3-031-15934-3_52
Download citation
DOI: https://doi.org/10.1007/978-3-031-15934-3_52
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15933-6
Online ISBN: 978-3-031-15934-3
eBook Packages: Computer ScienceComputer Science (R0)