Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

AB-LSTM: Attention-based Bidirectional LSTM Model for Scene Text Detection

Published: 16 December 2019 Publication History
  • Get Citation Alerts
  • Abstract

    Detection of scene text in arbitrary shapes is a challenging task in the field of computer vision. Most existing scene text detection methods exploit the rectangle/quadrangular bounding box to denote the detected text, which fails to accurately fit text with arbitrary shapes, such as curved text. In addition, recent progress on scene text detection has benefited from Fully Convolutional Network. Text cues contained in multi-level convolutional features are complementary for detecting scene text objects. How to explore these multi-level features is still an open problem. To tackle the above issues, we propose an Attention-based Bidirectional Long Short-Term Memory (AB-LSTM) model for scene text detection. First, word stroke regions (WSRs) and text center blocks (TCBs) are extracted by two AB-LSTM models, respectively. Then, the union of WSRs and TCBs are used to represent text objects. To verify the effectiveness of the proposed method, we perform experiments on four public benchmarks: CTW1500, Total-text, ICDAR2013, and MSRA-TD500, and compare it with existing state-of-the-art methods. Experiment results demonstrate that the proposed method can achieve competitive results, and well handle scene text objects with arbitrary shapes (i.e., curved, oriented, and horizontal forms).

    References

    [1]
    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. Retrieved from Arxiv Preprint Arxiv:1409.0473 (2014).
    [2]
    Michal Busta, Lukas Neumann, and Jiri Matas. 2015. Fastext: Efficient unconstrained scene text detector. In Proceedings of the International Conference on Computer Vision (ICCV’15). 1206--1214.
    [3]
    Chee Kheng Ch’ng and Chee Seng Chan. 2017. Total-text: A comprehensive dataset for scene text detection and recognition. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’17). 935--942.
    [4]
    Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2018. Paying more attention to saliency: Image captioning with saliency and context attention. ACM Trans. Multimedia Comput., Commun., Applic. 14, 2 (2018), 48.
    [5]
    Dan Deng, Haifeng Liu, Xuelong Li, and Deng Cai. 2018. PixelLink: Detecting scene text via instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’18). 6773--6780.
    [6]
    Boris Epshtein, Eyal Ofek, and Yonatan Wexler. 2010. Detecting text in natural scenes with stroke width transform. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). 2963--2970.
    [7]
    Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2010. The PASCAL visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 2 (2010), 303--338.
    [8]
    Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 2315--2324.
    [9]
    Dafang He, Xiao Yang, Chen Liang, Zihan Zhou, G. Alexander, I. I. Ororbia, Daniel Kifer, and C. Lee Giles. 2017. Multi-scale FCN with cascaded instance aware segmentation for arbitrary oriented word spotting in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 474--483.
    [10]
    Pan He, Weilin Huang, Tong He, Qile Zhu, Yu Qiao, and Xiaolin Li. 2017. Single shot text detector with regional attention. In Proceedings of the International Conference on Computer Vision (ICCV’17). 3047--3055.
    [11]
    Tong He, Weilin Huang, Yu Qiao, and Jian Yao. 2016. Accurate text localization in natural image with cascaded convolutional text network. Retrieved from: Arxiv Preprint Arxiv:1603.09423 (2016).
    [12]
    Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, and Changming Sun. 2018. An end-to-end textspotter with explicit alignment and attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 5020--5029.
    [13]
    Wenhao He, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu. 2017. Deep direct regression for multi-oriented scene text detection. In Proceedings of the International Conference on Computer Vision (ICCV’17). 745--753.
    [14]
    Han Hu, Chengquan Zhang, Yuxuan Luo, Yuzhuo Wang, Junyu Han, and Errui Ding. 2017. Wordsup: Exploiting word annotations for character-based text detection. In Proceedings of the International Conference on Computer Vision (ICCV’17). 4940--4949.
    [15]
    Shao Huang, Weiqiang Wang, Shengfeng He, and Rynson W. H. Lau. 2017. Egocentric hand detection via dynamic region growing. ACM Trans. Multimedia Comput., Commun., Applic. 14, 1 (2017), 10.
    [16]
    Weilin Huang, Zhe Lin, Jianchao Yang, and Jue Wang. 2013. Text localization in natural images using stroke feature transform and text covariance descriptors. In Proceedings of the International Conference on Computer Vision (ICCV’13). 1241--1248.
    [17]
    Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia (ACMMM’14). 675--678.
    [18]
    Yingying Jiang, Xiangyu Zhu, Xiaobing Wang, Shuli Yang, Wei Li, Hua Wang, Pei Fu, and Zhenbo Luo. 2017. R2CNN: Rotational region CNN for orientation robust scene text detection. Retrieved from Arxiv Preprint Arxiv:1706.09579 (2017).
    [19]
    Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu et al. 2015. ICDAR 2015 competition on robust reading. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’15). 1156--1160.
    [20]
    Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Almazan Almazan, and Lluis Pere De Las Heras. 2013. ICDAR 2013 robust reading competition. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’13). 1484--1493.
    [21]
    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’12). 1097--1105.
    [22]
    Dangwei Li, Xiaotang Chen, Zhang Zhang, and Kaiqi Huang. 2017. Learning deep context-aware features over body and latent parts for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 384--393.
    [23]
    Xiang Li, Wenhai Wang, Wenbo Hou, Ruo-Ze Liu, Tong Lu, and Jian Yang. 2018. Shape robust text detection with progressive scale expansion network. Retrieved from Arxiv Preprint Arxiv:1806.02559 (2018).
    [24]
    Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. 2017. TextBoxes: A fast text detector with a single deep neural network. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’17). 4161--4167.
    [25]
    Minghui Liao, Zhen Zhu, Baoguang Shi, Gui-song Xia, and Xiang Bai. 2018. Rotation-sensitive regression for oriented scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 5909--5918.
    [26]
    Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV’16). 21--37.
    [27]
    Yuliang Liu, Lianwen Jin, Shuaitao Zhang, and Sheng Zhang. 2017. Detecting curve text in the wild: New dataset and new solution. Retrieved from Arxiv Preprint Arxiv:1712.02170 (2017).
    [28]
    Zhandong Liu, Wengang Zhou, and Houqiang Li. 2019. Scene text detection with fully convolutional neural networks. Multimedia Tools Applic. 78, 13 (2019), 18205--18227.
    [29]
    Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, and Cong Yao. 2018. Textsnake: A flexible representation for detecting text of arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV’18). 19--35.
    [30]
    Xiang Long, Chuang Gan, Gerard de Melo, Xiao Liu, Yandong Li, Fu Li, and Shilei Wen. 2018. Multimodal keyless attention fusion for video classification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’18). 7202--7209.
    [31]
    Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. 2018. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV’18). 67--83.
    [32]
    Pengyuan Lyu, Cong Yao, Wenhao Wu, Shuicheng Yan, and Xiang Bai. 2018. Multi-oriented scene text detection via corner localization and region segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 7553--7563.
    [33]
    Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, Hong Wang, Yingbin Zheng, and Xiangyang Xue. 2018. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimedia 20, 11 (2018), 3111--3122.
    [34]
    Andrew Mehnert and Paul Jackway. 1997. An improved seeded region growing algorithm. Pattern Recog. Lett. 18, 10 (1997), 1065--1071.
    [35]
    Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas, Zhenbo Luo, Umapada Pal, Christophe Rigaud, Joseph Chazalon et al. 2017. ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification-RRC-MLT. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’17). 1454--1459.
    [36]
    Lukas Neumann and Jiri Matas. 2010. A method for text localization and recognition in real-world images. In Proceedings of the Asian Conference on Computer Vision (ACCV’10). 770--783.
    [37]
    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’15). 91--99.
    [38]
    Asif Shahab, Faisal Shafait, and Andreas Dengel. 2011. ICDAR 2011 robust reading competition challenge 2: Reading text in scene images. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’11). 1491--1496.
    [39]
    Baoguang Shi, Xiang Bai, and Serge Belongie. 2017. Detecting oriented text in natural images by linking segments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2550--2558.
    [40]
    Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Retrieved from Arxiv Preprint Arxiv:1409.1556 (2014).
    [41]
    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1--9.
    [42]
    Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. 2016. Detecting text in natural image with connectionist text proposal network. In Proceedings of the European Conference on Computer Vision (ECCV’16). 56--72.
    [43]
    Cheng Wang, Haojin Yang, and Christoph Meinel. 2018. Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Trans. Multimedia Comput., Commun., Applic. 14, 2s (2018), 40.
    [44]
    Christian Wolf and Jean-Michel Jolion. 2006. Object count/area graphs for the evaluation of object detection and segmentation algorithms. Int. J. Doc. Anal. Recog. 8, 4 (2006), 280--296.
    [45]
    Saining Xie and Zhuowen Tu. 2015. Holistically nested edge detection. In Proceedings of the International Conference on Computer Vision (ICCV’15). 1395--1403.
    [46]
    Wei Yang, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2016. End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 3073--3082.
    [47]
    Cong Yao, Xiang Bai, Wenyu Liu, Yi Ma, and Zhuowen Tu. 2012. Detecting texts of arbitrary orientations in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 1083--1090.
    [48]
    Cong Yao, Xiang Bai, Nong Sang, Xinyu Zhou, Shuchang Zhou, and Zhimin Cao. 2016. Scene text detection via holistic, multi-channel prediction. Retrieved from Arxiv Preprint Arxiv:1606.09002 (2016).
    [49]
    Xu-Cheng Yin, Xuwang Yin, Kaizhu Huang, and Hong-Wei Hao. 2014. Robust text detection in natural scene images. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 36, 5 (2014), 970--983.
    [50]
    Xu-Cheng Yin, Ze-Yu Zuo, Shu Tian, and Cheng-Lin Liu. 2016. Text detection, tracking and recognition in video: a comprehensive survey. IEEE Transactions on Image Processing (TIP) 25, 6 (2016), 2752--2773.
    [51]
    Lu Zhang, Ju Dai, Huchuan Lu, You He, and Gang Wang. 2018. A bi-directional message passing model for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 1741--1750.
    [52]
    Zheng Zhang, Chengquan Zhang, Wei Shen, Cong Yao, Wenyu Liu, and Xiang Bai. 2016. Multi-oriented text detection with fully convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 4159--4167.
    [53]
    Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. 2017. EAST: An efficient and accurate scene text detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2642--2651.
    [54]
    Yingying Zhu, Cong Yao, and Xiang Bai. 2016. Scene text detection and recognition: Recent advances and future trends. Front. Comput. Sci. (FCS) 10, 1 (2016), 19--36.

    Cited By

    View all
    • (2024)Multimodal Visual-Semantic Representations Learning for Scene Text RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3646551Online publication date: 19-Feb-2024
    • (2024)Buffer-text: Detecting arbitrary shaped text in natural scene imageEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107774130(107774)Online publication date: Apr-2024
    • (2024)Person Reidentification using 3D inception based Spatio-temporal features learning, attribute recognition, and RerankingMultimedia Tools and Applications10.1007/s11042-023-15473-z83:1(2007-2030)Online publication date: 1-Jan-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 4
    November 2019
    322 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3376119
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 December 2019
    Accepted: 01 August 2019
    Revised: 01 August 2019
    Received: 01 December 2018
    Published in TOMM Volume 15, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Scene text detection
    2. attention
    3. bidirectional LSTM
    4. feature fusion
    5. semantic segmentation

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)77
    • Downloads (Last 6 weeks)14

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Multimodal Visual-Semantic Representations Learning for Scene Text RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3646551Online publication date: 19-Feb-2024
    • (2024)Buffer-text: Detecting arbitrary shaped text in natural scene imageEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107774130(107774)Online publication date: Apr-2024
    • (2024)Person Reidentification using 3D inception based Spatio-temporal features learning, attribute recognition, and RerankingMultimedia Tools and Applications10.1007/s11042-023-15473-z83:1(2007-2030)Online publication date: 1-Jan-2024
    • (2024)DC-PSENet: a novel scene text detection method integrating double ResNet-based and changed channels recursive feature pyramidThe Visual Computer: International Journal of Computer Graphics10.1007/s00371-023-03093-540:6(4473-4491)Online publication date: 1-Jun-2024
    • (2023)Multi-level Attention-based Domain Disentanglement for BCDRACM Transactions on Information Systems10.1145/357692541:4(1-24)Online publication date: 23-Mar-2023
    • (2023)Context Sensing Attention Network for Video-based Person Re-identificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357320319:4(1-20)Online publication date: 27-Feb-2023
    • (2023)Perceptual Hashing of Deep Convolutional Neural Networks for Model Copy DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357277719:3(1-20)Online publication date: 2-Mar-2023
    • (2023)Referring Expression Comprehension Via Enhanced Cross-modal Graph Attention NetworksACM Transactions on Multimedia Computing, Communications, and Applications10.1145/354868819:2(1-21)Online publication date: 6-Feb-2023
    • (2023)Learning Pixel Affinity Pyramid for Arbitrary-Shaped Text DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/352461719:1s(1-24)Online publication date: 3-Feb-2023
    • (2023)Robust Image Hashing With Isomap and Saliency Map for Copy DetectionIEEE Transactions on Multimedia10.1109/TMM.2021.313921725(1085-1097)Online publication date: 1-Jan-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media