research-article

AB-LSTM: Attention-based Bidirectional LSTM Model for Scene Text Detection

Authors:

Wengang Zhou, and

Houqiang LiAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 15, Issue 4

Article No.: 107, Pages 1 - 23

https://doi.org/10.1145/3356728

Published: 16 December 2019 Publication History

Abstract

Detection of scene text in arbitrary shapes is a challenging task in the field of computer vision. Most existing scene text detection methods exploit the rectangle/quadrangular bounding box to denote the detected text, which fails to accurately fit text with arbitrary shapes, such as curved text. In addition, recent progress on scene text detection has benefited from Fully Convolutional Network. Text cues contained in multi-level convolutional features are complementary for detecting scene text objects. How to explore these multi-level features is still an open problem. To tackle the above issues, we propose an Attention-based Bidirectional Long Short-Term Memory (AB-LSTM) model for scene text detection. First, word stroke regions (WSRs) and text center blocks (TCBs) are extracted by two AB-LSTM models, respectively. Then, the union of WSRs and TCBs are used to represent text objects. To verify the effectiveness of the proposed method, we perform experiments on four public benchmarks: CTW1500, Total-text, ICDAR2013, and MSRA-TD500, and compare it with existing state-of-the-art methods. Experiment results demonstrate that the proposed method can achieve competitive results, and well handle scene text objects with arbitrary shapes (i.e., curved, oriented, and horizontal forms).

References

[1]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. Retrieved from Arxiv Preprint Arxiv:1409.0473 (2014).

[2]

Michal Busta, Lukas Neumann, and Jiri Matas. 2015. Fastext: Efficient unconstrained scene text detector. In Proceedings of the International Conference on Computer Vision (ICCV’15). 1206--1214.

[3]

Chee Kheng Ch’ng and Chee Seng Chan. 2017. Total-text: A comprehensive dataset for scene text detection and recognition. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’17). 935--942.

[4]

Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2018. Paying more attention to saliency: Image captioning with saliency and context attention. ACM Trans. Multimedia Comput., Commun., Applic. 14, 2 (2018), 48.

[5]

Dan Deng, Haifeng Liu, Xuelong Li, and Deng Cai. 2018. PixelLink: Detecting scene text via instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’18). 6773--6780.

[6]

Boris Epshtein, Eyal Ofek, and Yonatan Wexler. 2010. Detecting text in natural scenes with stroke width transform. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). 2963--2970.

[7]

Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2010. The PASCAL visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 2 (2010), 303--338.

Digital Library

[8]

Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 2315--2324.

[9]

Dafang He, Xiao Yang, Chen Liang, Zihan Zhou, G. Alexander, I. I. Ororbia, Daniel Kifer, and C. Lee Giles. 2017. Multi-scale FCN with cascaded instance aware segmentation for arbitrary oriented word spotting in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 474--483.

[10]

Pan He, Weilin Huang, Tong He, Qile Zhu, Yu Qiao, and Xiaolin Li. 2017. Single shot text detector with regional attention. In Proceedings of the International Conference on Computer Vision (ICCV’17). 3047--3055.

[11]

Tong He, Weilin Huang, Yu Qiao, and Jian Yao. 2016. Accurate text localization in natural image with cascaded convolutional text network. Retrieved from: Arxiv Preprint Arxiv:1603.09423 (2016).

[12]

Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, and Changming Sun. 2018. An end-to-end textspotter with explicit alignment and attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 5020--5029.

[13]

Wenhao He, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu. 2017. Deep direct regression for multi-oriented scene text detection. In Proceedings of the International Conference on Computer Vision (ICCV’17). 745--753.

[14]

Han Hu, Chengquan Zhang, Yuxuan Luo, Yuzhuo Wang, Junyu Han, and Errui Ding. 2017. Wordsup: Exploiting word annotations for character-based text detection. In Proceedings of the International Conference on Computer Vision (ICCV’17). 4940--4949.

[15]

Shao Huang, Weiqiang Wang, Shengfeng He, and Rynson W. H. Lau. 2017. Egocentric hand detection via dynamic region growing. ACM Trans. Multimedia Comput., Commun., Applic. 14, 1 (2017), 10.

[16]

Weilin Huang, Zhe Lin, Jianchao Yang, and Jue Wang. 2013. Text localization in natural images using stroke feature transform and text covariance descriptors. In Proceedings of the International Conference on Computer Vision (ICCV’13). 1241--1248.

Digital Library

[17]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia (ACMMM’14). 675--678.

Digital Library

[18]

Yingying Jiang, Xiangyu Zhu, Xiaobing Wang, Shuli Yang, Wei Li, Hua Wang, Pei Fu, and Zhenbo Luo. 2017. R2CNN: Rotational region CNN for orientation robust scene text detection. Retrieved from Arxiv Preprint Arxiv:1706.09579 (2017).

[19]

Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu et al. 2015. ICDAR 2015 competition on robust reading. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’15). 1156--1160.

[20]

Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Almazan Almazan, and Lluis Pere De Las Heras. 2013. ICDAR 2013 robust reading competition. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’13). 1484--1493.

[21]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’12). 1097--1105.

Digital Library

[22]

Dangwei Li, Xiaotang Chen, Zhang Zhang, and Kaiqi Huang. 2017. Learning deep context-aware features over body and latent parts for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 384--393.

[23]

Xiang Li, Wenhai Wang, Wenbo Hou, Ruo-Ze Liu, Tong Lu, and Jian Yang. 2018. Shape robust text detection with progressive scale expansion network. Retrieved from Arxiv Preprint Arxiv:1806.02559 (2018).

[24]

Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. 2017. TextBoxes: A fast text detector with a single deep neural network. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’17). 4161--4167.

[25]

Minghui Liao, Zhen Zhu, Baoguang Shi, Gui-song Xia, and Xiang Bai. 2018. Rotation-sensitive regression for oriented scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 5909--5918.

[26]

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV’16). 21--37.

[27]

Yuliang Liu, Lianwen Jin, Shuaitao Zhang, and Sheng Zhang. 2017. Detecting curve text in the wild: New dataset and new solution. Retrieved from Arxiv Preprint Arxiv:1712.02170 (2017).

[28]

Zhandong Liu, Wengang Zhou, and Houqiang Li. 2019. Scene text detection with fully convolutional neural networks. Multimedia Tools Applic. 78, 13 (2019), 18205--18227.

Digital Library

[29]

Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, and Cong Yao. 2018. Textsnake: A flexible representation for detecting text of arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV’18). 19--35.

Digital Library

[30]

Xiang Long, Chuang Gan, Gerard de Melo, Xiao Liu, Yandong Li, Fu Li, and Shilei Wen. 2018. Multimodal keyless attention fusion for video classification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’18). 7202--7209.

[31]

Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. 2018. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV’18). 67--83.

Digital Library

[32]

Pengyuan Lyu, Cong Yao, Wenhao Wu, Shuicheng Yan, and Xiang Bai. 2018. Multi-oriented scene text detection via corner localization and region segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 7553--7563.

[33]

Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, Hong Wang, Yingbin Zheng, and Xiangyang Xue. 2018. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimedia 20, 11 (2018), 3111--3122.

Digital Library

[34]

Andrew Mehnert and Paul Jackway. 1997. An improved seeded region growing algorithm. Pattern Recog. Lett. 18, 10 (1997), 1065--1071.

Digital Library

[35]

Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas, Zhenbo Luo, Umapada Pal, Christophe Rigaud, Joseph Chazalon et al. 2017. ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification-RRC-MLT. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’17). 1454--1459.

[36]

Lukas Neumann and Jiri Matas. 2010. A method for text localization and recognition in real-world images. In Proceedings of the Asian Conference on Computer Vision (ACCV’10). 770--783.

[37]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’15). 91--99.

Digital Library

[38]

Asif Shahab, Faisal Shafait, and Andreas Dengel. 2011. ICDAR 2011 robust reading competition challenge 2: Reading text in scene images. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’11). 1491--1496.

[39]

Baoguang Shi, Xiang Bai, and Serge Belongie. 2017. Detecting oriented text in natural images by linking segments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2550--2558.

[40]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Retrieved from Arxiv Preprint Arxiv:1409.1556 (2014).

[41]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1--9.

[42]

Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. 2016. Detecting text in natural image with connectionist text proposal network. In Proceedings of the European Conference on Computer Vision (ECCV’16). 56--72.

[43]

Cheng Wang, Haojin Yang, and Christoph Meinel. 2018. Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Trans. Multimedia Comput., Commun., Applic. 14, 2s (2018), 40.

[44]

Christian Wolf and Jean-Michel Jolion. 2006. Object count/area graphs for the evaluation of object detection and segmentation algorithms. Int. J. Doc. Anal. Recog. 8, 4 (2006), 280--296.

Digital Library

[45]

Saining Xie and Zhuowen Tu. 2015. Holistically nested edge detection. In Proceedings of the International Conference on Computer Vision (ICCV’15). 1395--1403.

Digital Library

[46]

Wei Yang, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2016. End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 3073--3082.

[47]

Cong Yao, Xiang Bai, Wenyu Liu, Yi Ma, and Zhuowen Tu. 2012. Detecting texts of arbitrary orientations in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 1083--1090.

[48]

Cong Yao, Xiang Bai, Nong Sang, Xinyu Zhou, Shuchang Zhou, and Zhimin Cao. 2016. Scene text detection via holistic, multi-channel prediction. Retrieved from Arxiv Preprint Arxiv:1606.09002 (2016).

[49]

Xu-Cheng Yin, Xuwang Yin, Kaizhu Huang, and Hong-Wei Hao. 2014. Robust text detection in natural scene images. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 36, 5 (2014), 970--983.

[50]

Xu-Cheng Yin, Ze-Yu Zuo, Shu Tian, and Cheng-Lin Liu. 2016. Text detection, tracking and recognition in video: a comprehensive survey. IEEE Transactions on Image Processing (TIP) 25, 6 (2016), 2752--2773.

Digital Library

[51]

Lu Zhang, Ju Dai, Huchuan Lu, You He, and Gang Wang. 2018. A bi-directional message passing model for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 1741--1750.

[52]

Zheng Zhang, Chengquan Zhang, Wei Shen, Cong Yao, Wenyu Liu, and Xiang Bai. 2016. Multi-oriented text detection with fully convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 4159--4167.

[53]

Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. 2017. EAST: An efficient and accurate scene text detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2642--2651.

[54]

Yingying Zhu, Cong Yao, and Xiang Bai. 2016. Scene text detection and recognition: Recent advances and future trends. Front. Comput. Sci. (FCS) 10, 1 (2016), 19--36.

Digital Library

Cited By

Gao XPang YLiu YHan MYu JWang WChen Y(2024)Multimodal Visual-Semantic Representations Learning for Scene Text RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3646551Online publication date: 19-Feb-2024
https://doi.org/10.1145/3646551
Yang KYi JChen AJin Z(2024)Buffer-text: Detecting arbitrary shaped text in natural scene imageEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107774130(107774)Online publication date: Apr-2024
https://doi.org/10.1016/j.engappai.2023.107774
Choudhary MTiwari VJain SRajpoot V(2024)Person Reidentification using 3D inception based Spatio-temporal features learning, attribute recognition, and RerankingMultimedia Tools and Applications10.1007/s11042-023-15473-z83:1(2007-2030)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1007/s11042-023-15473-z
Show More Cited By

Index Terms

AB-LSTM: Attention-based Bidirectional LSTM Model for Scene Text Detection
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Image segmentation
        Interest point and salient region detections
      2. Computer vision tasks
        Scene understanding

Recommendations

MFECN: Multi-level Feature Enhanced Cumulative Network for Scene Text Detection
Recently, many scene text detection algorithms have achieved impressive performance by using convolutional neural networks. However, most of them do not make full use of the context among the hierarchical multi-level features to improve the performance of ...
Read More
Scene text detection with fully convolutional neural networks

Text detection in scene image has become a hot topic in computer vision and artificial intelligence research, due to its wide range of applications and challenges. Most state-of-the-art methods for text detection based on deep learning rely on text ...
Read More
Detection and rectification of arbitrary shaped scene texts by using text keypoints and links
Highlights
- We propose a robust scene text detection and rectification technique that is capable of detecting and rectifying scene texts of arbitrary shapes almost ...
Abstract
Detection and recognition of scene texts of arbitrary shapes remain a grand challenge due to the super-rich text shape variation in text line orientations, lengths, curvatures, etc. This paper presents a mask-guided multi-task network ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 15, Issue 4

November 2019

322 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3376119

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 December 2019

Accepted: 01 August 2019

Revised: 01 August 2019

Received: 01 December 2018

Published in TOMM Volume 15, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Youth Innovation Promotion Association of the Chinese Academy of Sciences
National Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

40
Total Citations
View Citations
589
Total Downloads

Downloads (Last 12 months)77
Downloads (Last 6 weeks)14

Other Metrics

View Author Metrics

Citations

Cited By

Gao XPang YLiu YHan MYu JWang WChen Y(2024)Multimodal Visual-Semantic Representations Learning for Scene Text RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3646551Online publication date: 19-Feb-2024
https://doi.org/10.1145/3646551
Yang KYi JChen AJin Z(2024)Buffer-text: Detecting arbitrary shaped text in natural scene imageEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107774130(107774)Online publication date: Apr-2024
https://doi.org/10.1016/j.engappai.2023.107774
Choudhary MTiwari VJain SRajpoot V(2024)Person Reidentification using 3D inception based Spatio-temporal features learning, attribute recognition, and RerankingMultimedia Tools and Applications10.1007/s11042-023-15473-z83:1(2007-2030)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1007/s11042-023-15473-z
Huang LLiao SYang W(2024)DC-PSENet: a novel scene text detection method integrating double ResNet-based and changed channels recursive feature pyramidThe Visual Computer: International Journal of Computer Graphics10.1007/s00371-023-03093-540:6(4473-4491)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1007/s00371-023-03093-5
Zhang XLi JSu HZhu LShen H(2023)Multi-level Attention-based Domain Disentanglement for BCDRACM Transactions on Information Systems10.1145/357692541:4(1-24)Online publication date: 23-Mar-2023
https://dl.acm.org/doi/10.1145/3576925
Wang KDing CPang JXu X(2023)Context Sensing Attention Network for Video-based Person Re-identificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357320319:4(1-20)Online publication date: 27-Feb-2023
https://dl.acm.org/doi/10.1145/3573203
Chen HZhou HZhang JChen DZhang WChen KHua GYu N(2023)Perceptual Hashing of Deep Convolutional Neural Networks for Model Copy DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357277719:3(1-20)Online publication date: 2-Mar-2023
https://dl.acm.org/doi/10.1145/3572777
Wang JKe JShuai HLi YCheng W(2023)Referring Expression Comprehension Via Enhanced Cross-modal Graph Attention NetworksACM Transactions on Multimedia Computing, Communications, and Applications10.1145/354868819:2(1-21)Online publication date: 6-Feb-2023
https://dl.acm.org/doi/10.1145/3548688
Fu ZXie HFang SWang YXing MZhang Y(2023)Learning Pixel Affinity Pyramid for Arbitrary-Shaped Text DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/352461719:1s(1-24)Online publication date: 3-Feb-2023
https://dl.acm.org/doi/10.1145/3524617
Liang XTang ZWu JLi ZZhang X(2023)Robust Image Hashing With Isomap and Saliency Map for Copy DetectionIEEE Transactions on Multimedia10.1109/TMM.2021.313921725(1085-1097)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2021.3139217
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents