Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Learning Pixel Affinity Pyramid for Arbitrary-Shaped Text Detection

Published: 03 February 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Arbitrary-shaped text detection in natural images is a challenging task due to the complexity of the background and the diversity of text properties. The difficulty lies in two aspects: accurate separation of adjacent texts and sufficient text feature representation. To handle these problems, we consider text detection as instance segmentation and propose a novel text detection framework, which jointly learns semantic segmentation and a pixel affinity pyramid in a unified fully convolutional network. Specifically, the pixel affinity pyramid is proposed to encode multi-scale instance affiliation relationships of pixels, which is not only robust to varying shapes of text but also provides an accurate boundary description for separating closely located texts. In the inference phase, a simple but effective post-processing is presented to reconstruct text instances from the semantic segmentation results under the guidance of the learned pixel affinity pyramid, achieving good accuracy and efficiency. Furthermore, to enhance the representation of text features in the neural network, two modules — the Region Enhancement Module (REM) and Attentional Fusion Module (AFM) — are proposed. The REM models the semantic correlations of regional features to enhance the features from the text area, which effectively suppresses false-positive detection. The AFM adaptively fuses multi-scale textual information through an attention mechanism to obtain abundant text semantic features, which benefits multi-sized text detection. Extensive ablation experiments are conducted demonstrating the effectiveness of the REM and AFM. Evaluation results on standard benchmarks, including Total-Text, ICDAR2015, SCUT-CTW1500, and MSRA-TD500, show that our method surpasses most existing text detectors and achieves state-of-the-art performance, denoting its superior capability in detecting arbitrary-shaped texts.

    References

    [1]
    Yunpeng Chen, Marcus Rohrbach, Zhicheng Yan, Yan Shuicheng, Jiashi Feng, and Yannis Kalantidis. 2019. Graph-based global reasoning networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 433–442.
    [2]
    Chee Kheng Ch’ng and Chee Seng Chan. 2017. Total-text: A comprehensive dataset for scene text detection and recognition. In 14th IAPR International Conference on Document Analysis and Recognition (ICDAR’17), Vol. 1. IEEE, 935–942.
    [3]
    Dan Deng, Haifeng Liu, Xuelong Li, and Deng Cai. 2018. Pixellink: Detecting scene text via instance segmentation. In 32nd AAAI Conference on Artificial Intelligence.
    [4]
    Boris Epshtein, Eyal Ofek, and Yonatan Wexler. 2010. Detecting text in natural scenes with stroke width transform. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 2963–2970.
    [5]
    Shancheng Fang, Hongtao Xie, Yuxin Wang, Zhendong Mao, and Yongdong Zhang. 2021. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 7098–7107.
    [6]
    Wei Feng, Wenhao He, Fei Yin, Xu-Yao Zhang, and Cheng-Lin Liu. 2019. TextDragon: An end-to-end framework for arbitrary shaped text spotting. In Proceedings of the IEEE International Conference on Computer Vision. 9076–9085.
    [7]
    Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. 2019. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19).
    [8]
    Naiyu Gao, Yanhu Shan, Yupei Wang, Xin Zhao, Yinan Yu, Ming Yang, and Kaiqi Huang. 2019. SSAP: Single-shot instance segmentation with affinity pyramid. In Proceedings of the IEEE International Conference on Computer Vision. 642–651.
    [9]
    Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2315–2324.
    [10]
    Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 2961–2969.
    [11]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
    [12]
    K. He, X. Zhang, S. Ren, and J. Sun. 2016. Very deep convolutional networks for large-scale image recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
    [13]
    Pan He, Weilin Huang, Tong He, Qile Zhu, Yu Qiao, and Xiaolin Li. 2017. Single shot text detector with regional attention. In Proceedings of the IEEE International Conference on Computer Vision. 3047–3055.
    [14]
    Wenhao He, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu. 2017. Deep direct regression for multi-oriented scene text detection. In Proceedings of the IEEE International Conference on Computer Vision. 745–753.
    [15]
    Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7132–7141.
    [16]
    Weilin Huang, Zhe Lin, Jianchao Yang, and Jue Wang. 2013. Text localization in natural images using stroke feature transform and text covariance descriptors. In Proceedings of the IEEE International Conference on Computer Vision. 1241–1248.
    [17]
    Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. 2019. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision. 603–612.
    [18]
    Zhida Huang, Zhuoyao Zhong, Lei Sun, and Qiang Huo. 2019. Mask R-CNN with pyramid attention network for scene text detection. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV’19). IEEE, 764–772.
    [19]
    Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. 2015. Spatial transformer networks. In Advances in Neural Information Processing Systems. 2017–2025.
    [20]
    Chulmoo Kang, Gunhee Kim, and Suk I. Yoo. 2017. Detection and recognition of text embedded in online images via neural context models. In 31st AAAI Conference on Artificial Intelligence.
    [21]
    Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, et al. 2015. ICDAR 2015 competition on robust reading. In 13th International Conference on Document Analysis and Recognition (ICDAR’15). IEEE, 1156–1160.
    [22]
    Thomas N. Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
    [23]
    Zhaoju Li, Zongwei Zhou, Nan Jiang, Zhenjun Han, Junliang Xing, and Jianbin Jiao. 2020. Spatial preserved graph convolution networks for person re-identification. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 1s, Article 26 (April2020), 14 pages. DOI:
    [24]
    Minghui Liao, Baoguang Shi, and Xiang Bai. 2018. Textboxes++: A single-shot oriented scene text detector. IEEE Transactions on Image Processing 27, 8 (2018), 3676–3690.
    [25]
    Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. 2017. Textboxes: A fast text detector with a single deep neural network. In 31st AAAI Conference on Artificial Intelligence.
    [26]
    Minghui Liao, Zhen Zhu, Baoguang Shi, Gui-song Xia, and Xiang Bai. 2018. Rotation-sensitive regression for oriented scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5909–5918.
    [27]
    Yiding Liu, Siyu Yang, Bin Li, Wengang Zhou, Jizheng Xu, Houqiang Li, and Yan Lu. 2018. Affinity derivation and graph merge for instance segmentation. In Proceedings of the European Conference on Computer Vision (ECCV’18). 686–703.
    [28]
    Zichuan Liu, Guosheng Lin, Sheng Yang, Jiashi Feng, Weisi Lin, and Wang Ling Goh. 2018. Learning Markov clustering networks for scene text detection. arXiv preprint arXiv:1805.08365 (2018).
    [29]
    Zichuan Liu, Guosheng Lin, Sheng Yang, Fayao Liu, Weisi Lin, and Wang Ling Goh. 2019. Towards robust curve text detection with conditional spatial expansion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7269–7278.
    [30]
    Zhandong Liu, Wengang Zhou, and Houqiang Li. 2019. AB-LSTM: Attention-based bidirectional LSTM model for scene text detection. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 4 (2019), 1–23.
    [31]
    Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, and Cong Yao. 2018. Textsnake: A flexible representation for detecting text of arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV’18). 20–36.
    [32]
    Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. 2018. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV’18). 67–83.
    [33]
    Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, Hong Wang, Yingbin Zheng, and Xiangyang Xue. 2018. Arbitrary-oriented scene text detection via rotation proposals. IEEE Transactions on Multimedia 20, 11 (2018), 3111–3122.
    [34]
    Jiri Matas, Ondrej Chum, Martin Urban, and Tomás Pajdla. 2004. Robust wide-baseline stereo from maximally stable extremal regions. Image and Vision Computing 22, 10 (2004), 761–767.
    [35]
    Tao Mei, Lusong Li, Xian-Sheng Hua, and Shipeng Li. 2012. ImageSense: Towards contextual image advertising. ACM Transactions on Multimedia Computing, Communications, and Applications 8, 1, Article 6 (Feb.2012), 18 pages. DOI:
    [36]
    Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. 2016. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In 4th International Conference on 3D Vision (3DV’16). IEEE, 565–571.
    [37]
    Lukáš Neumann and Jiří Matas. 2012. Real-time scene text localization and recognition. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3538–3545.
    [38]
    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91–99.
    [39]
    Baoguang Shi, Xiang Bai, and Serge Belongie. 2017. Detecting oriented text in natural images by linking segments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2550–2558.
    [40]
    Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. 2016. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 761–769.
    [41]
    Zhuotao Tian, Michelle Shu, Pengyuan Lyu, Ruiyu Li, Chao Zhou, Xiaoyong Shen, and Jiaya Jia. 2019. Learning shape-aware embedding for scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4234–4243.
    [42]
    Fangfang Wang, Liming Zhao, Xi Li, Xinchao Wang, and Dacheng Tao. 2018. Geometry-aware scene text detection with instance transformation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1381–1389.
    [43]
    Kai Wang and Serge Belongie. 2010. Word spotting in the wild. In European Conference on Computer Vision. Springer, 591–604.
    [44]
    Pengfei Wang, Chengquan Zhang, Fei Qi, Zuming Huang, Mengyi En, Junyu Han, Jingtuo Liu, Errui Ding, and Guangming Shi. 2019. A single-shot arbitrarily-shaped text detector based on context attended multi-task learning. In Proceedings of the 27th ACM International Conference on Multimedia. 1277–1285.
    [45]
    Wenhai Wang, Enze Xie, Xiang Li, Wenbo Hou, Tong Lu, Gang Yu, and Shuai Shao. 2019. Shape robust text detection with progressive scale expansion network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9336–9345.
    [46]
    Wenhai Wang, Enze Xie, Xiaoge Song, Yuhang Zang, Wenjia Wang, Tong Lu, Gang Yu, and Chunhua Shen. 2019. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In Proceedings of the IEEE International Conference on Computer Vision. 8440–8449.
    [47]
    Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7794–7803.
    [48]
    Xiaobing Wang, Yingying Jiang, Zhenbo Luo, Cheng-Lin Liu, Hyunsoo Choi, and Sungjin Kim. 2019. Arbitrary shape scene text detection with adaptive text region representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6449–6458.
    [49]
    Yuxin Wang, Hongtao Xie, Shancheng Fang, Jing Wang, Shenggao Zhu, and Yongdong Zhang. 2021. From two to one: A new scene text recognizer with visual language modeling network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21). 14194–14203.
    [50]
    Y. Wang, H. Xie, Z. Zha, Y. Tian, Z. Fu, and Y. Zhang. 2020. R-Net: A relationship network for efficient and accurate scene text detection. IEEE Transactions on Multimedia (2020), 1–1.
    [51]
    Enze Xie, Yuhang Zang, Shuai Shao, Gang Yu, Cong Yao, and Guangyao Li. 2019. Scene text detection with supervised pyramid context network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9038–9045.
    [52]
    Hongtao Xie, Shancheng Fang, Zheng-Jun Zha, Yating Yang, Yan Li, and Yongdong Zhang. 2019. Convolutional attention networks for scene text recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1s (2019), 1–17.
    [53]
    Dan Xu, Wei Wang, Hao Tang, Hong Liu, Nicu Sebe, and Elisa Ricci. 2018. Structured attention guided convolutional neural fields for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3917–3925.
    [54]
    Yongchao Xu, Yukang Wang, Wei Zhou, Yongpan Wang, Zhibo Yang, and Xiang Bai. 2019. TextField: Learning a deep direction field for irregular scene text detection. IEEE Transactions on Image Processing 28, 11 (2019), 5566–5579.
    [55]
    Chuhui Xue, Shijian Lu, and Wei Zhang. 2019. MSR: Multi-scale shape regression for scene text detection. arXiv preprint arXiv:1901.02596 (2019).
    [56]
    Cong Yao, Xiang Bai, and Wenyu Liu. 2014. A unified framework for multioriented text detection and recognition. IEEE Transactions on Image Processing 23, 11 (2014), 4737–4749.
    [57]
    Cong Yao, Xiang Bai, Wenyu Liu, Yi Ma, and Zhuowen Tu. 2012. Detecting texts of arbitrary orientations in natural images. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1083–1090.
    [58]
    Cong Yao, Xiang Bai, Nong Sang, Xinyu Zhou, Shuchang Zhou, and Zhimin Cao. 2016. Scene text detection via holistic, multi-channel prediction. arXiv preprint arXiv:1606.09002 (2016).
    [59]
    Jian Ye, Zhe Chen, Juhua Liu, and Bo Du. 2020. TextFuseNet: Scene text detection with richer fused features. IJCAI.
    [60]
    Liu Yuliang, Jin Lianwen, Zhang Shuaitao, and Zhang Sheng. 2017. Detecting curve text in the wild: New dataset and new solution. arXiv preprint arXiv:1712.02170 (2017).
    [61]
    Chengquan Zhang, Borong Liang, Zuming Huang, Mengyi En, Junyu Han, Errui Ding, and Xinghao Ding. 2019. Look more than once: An accurate detector for text of arbitrary shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10552–10561.
    [62]
    Li Zhang, Dan Xu, Anurag Arnab, and Philip H. S. Torr. 2020. Dynamic graph message passing networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3726–3735.
    [63]
    Songyang Zhang, Shipeng Yan, and Xuming He. 2019. LatentGNN: Learning efficient non-local relations for visual recognition. arXiv preprint arXiv:1905.11634 (2019).
    [64]
    Peng Zhou, Bingbing Ni, Cong Geng, Jianguo Hu, and Yi Xu. 2018. Scale-transferrable object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).
    [65]
    Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. 2017. EAST: An efficient and accurate scene text detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5551–5560.
    [66]
    Yanzhao Zhou, Qixiang Ye, Qiang Qiu, and Jianbin Jiao. 2017. Oriented response networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 519–528.

    Cited By

    View all
    • (2024)A Review on the Application of Segmentation-Based Text Detection Techniques for Natural ScenesArtificial Intelligence and Robotics Research10.12677/airr.2024.13204113:02(399-407)Online publication date: 2024
    • (2024)Combining Swin Transformer and Attention-Weighted Fusion for Scene Text DetectionNeural Processing Letters10.1007/s11063-024-11501-756:2Online publication date: 17-Feb-2024
    • (2023)Text Growing on LeafIEEE Transactions on Multimedia10.1109/TMM.2023.324432225(9029-9043)Online publication date: 1-Jan-2023

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 1s
    February 2023
    504 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3572859
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 February 2023
    Online AM: 04 July 2022
    Accepted: 08 March 2022
    Revised: 21 February 2022
    Received: 01 September 2020
    Published in TOMM Volume 19, Issue 1s

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Scene text detection
    2. pixel affinity
    3. deep learning
    4. instance segmentation

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • National Nature Science Foundation of China
    • Fundamental Research Funds for the Central Universities
    • Youth Innovation Promotion Association Chinese Academy of Sciences

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)98
    • Downloads (Last 6 weeks)8

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Review on the Application of Segmentation-Based Text Detection Techniques for Natural ScenesArtificial Intelligence and Robotics Research10.12677/airr.2024.13204113:02(399-407)Online publication date: 2024
    • (2024)Combining Swin Transformer and Attention-Weighted Fusion for Scene Text DetectionNeural Processing Letters10.1007/s11063-024-11501-756:2Online publication date: 17-Feb-2024
    • (2023)Text Growing on LeafIEEE Transactions on Multimedia10.1109/TMM.2023.324432225(9029-9043)Online publication date: 1-Jan-2023

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media