FACLSTM: ConvLSTM with focused attention for scene text recognition

Wang, Qingqing; Huang, Ye; Jia, Wenjing; He, Xiangjian; Blumenstein, Michael; Lyu, Shujing; Lu, Yue

doi:10.1007/s11432-019-2713-1

FACLSTM: ConvLSTM with focused attention for scene text recognition

Research Paper
Special Focus on Deep Learning for Computer Vision
Published: 15 January 2020

Volume 63, article number 120103, (2020)
Cite this article

Science China Information Sciences Aims and scope Submit manuscript

Qingqing Wang^1,2,
Ye Huang²,
Wenjing Jia²,
Xiangjian He²,
Michael Blumenstein²,
Shujing Lyu¹ &
…
Yue Lu^1,3

636 Accesses
28 Citations
7 Altmetric
Explore all metrics

Abstract

Scene text recognition has recently been widely treated as a sequence-to-sequence prediction problem, where traditional fully-connected-LSTM (FC-LSTM) has played a critical role. Owing to the limitation of FC-LSTM, existing methods have to convert 2-D feature maps into 1-D sequential feature vectors, resulting in severe damages of the valuable spatial and structural information of text images. In this paper, we argue that scene text recognition is essentially a spatiotemporal prediction problem for its 2-D image inputs, and propose a convolution LSTM (ConvLSTM)-based scene text recognizer, namely, FACLSTM, i.e., focused attention ConvLSTM, where the spatial correlation of pixels is fully leveraged when performing sequential prediction with LSTM. Particularly, the attention mechanism is properly incorporated into an efficient ConvLSTM structure via the convolutional operations and additional character center masks are generated to help focus attention on right feature areas. The experimental results on benchmark datasets IIIT5K, SVT and CUTE demonstrate that our proposed FACLSTM performs competitively on the regular, low-resolution and noisy text images, and outperforms the state-of-the-art approaches on the curved text images with large margins.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PIEED: Position information enhanced encoder-decoder framework for scene text recognition

Article 10 February 2021

Revolutionizing Scene Text Recognition: Unleashing the Power of Dual Step Attention Mechanism in the Encoder

Character Flow Detection and Rectification for Scene Text Spotting

References

Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput, 1997, 9: 1735–1780
Article Google Scholar
Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd International Conference on Learning Representations, 2015
Chorowski J, Bahdanau D, Serdyuk D, et al. Attention-based models for speech recognition. 2015. ArXiv: 1506.07503
Gao Y Z, Chen Y Y, Wang J Q, et al. Dense chained attention network for scene text recognition. In: Proceedings of International Conference on Image Processing, 2018
Cheng Z Z, Bai F, Xu Y L, et al. Focusing attention: towards accurate text recognition in natural images. In: Proceedings of IEEE International Conference on Computer Vision, 2017
Cheng Z Z, Xu Y L, Bai F, et al. AON: towards arbitrarily-oriented text recognition. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2018
Shi B G, Bai X, Yao C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell, 2017, 39: 2298–2304
Article Google Scholar
Shi B G, Wang X G, Lyu P Y, et al. Robust scene text recognition with automatic rectification. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2016
Bartz C, Yang H J, Meinel C. STN-OCR: a single neural network for text detection and recognition. 2017. ArXiv: 1707.08831v1
Liao M H, Zhang J, Wan Z Y, et al. Scene text recognition from two-dimensional perspective. In: Proceedings of AAAI Conference on Artificial Intelligence, 2019
Shi X J, Chen Z R, Wang H, et al. Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Proceedings of Neural Information Processing Systems, 2015
Gao Y Z, Chen Y Y, Wang J Q, et al. Reading scene text with attention convolutional sequence modeling. 2017. ArXiv: 1709.04303v1
Wojna Z, Gorban A, Lee D, et al. Attention-based extraction of structured information from street view imagery. In: Proceedings of International Conference on Document Analysis and Recognition, 2017
Liu M, Zhu M L. Mobile video object detection with temporally-aware feature maps. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2018
Ye Q X, Doermann D. Text detection and recognition in imagery: a survey. IEEE Trans Pattern Anal Mach Intell, 2015, 37: 1480–1500
Article Google Scholar
Shi B G, Yang M K, Wang X G, et al. ASTER: an attentional scene text recognizer with flexible rectification. IEEE Trans Pattern Anal Mach Intell, 2019, 41: 2035–2048
Article Google Scholar
Lee C Y, Osindero S. Recursive recurrent nets with attention modeling for OCR in the wild. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2016
Bai X, Liao M K, Shi B G, et al. Deep learning for scene text detection and recognition (in Chinese). Sci Sin Inform, 2018, 48: 531–544
Article Google Scholar
Jaderberg M, Simonyan K, Zisserman A, et al. Spatial transformer networks. 2015. ArXiv: 1506.02025
Bai F, Cheng Z Z, Niu Y, et al. Edit probability for scene text recognition. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2018
Su B L, Lu S J. Accurate recognition of words in scenes without character segmentation using recurrent neural network. Pattern Recogn, 2017, 63: 397–405
Article Google Scholar
Su B L, Lu S J. Accurate scene text recognition based on recurrent neural network. In: Proceedings of Asian Conference on Computer Vision, 2014
Li H, Wang P, Shen C H, et al. Show, attend and read: a simple and strong baseline for irregular text recognition. In: Proceedings of AAAI Conference on Artificial Intelligence, 2019
Jaderberg M, Vedaldi A, Zisserman A. Deep features for text spotting. In: Proceedings of European Conference on Computer Vision, 2014
Tian S X, Bhattacharya U, Lu S J, et al. Multilingual scene character recognition with co-occurrence of histogram of oriented gradients. Pattern Recogn, 2016, 51: 125–134
Article Google Scholar
Liu Z C, Li Y X, Ren F B, et al. SqueezedText: a real-time scene text recognition by binary convolutional encoderdecoder network. In: Proceedings of AAAI Conference on Artificial Intelligence, 2018
Huang T J, Tian Y H, Li J, et al. Salient region detection and segmentation for general object recognition and image understanding. Sci China Inf Sci, 2011, 54: 2461–2470
Article MathSciNet Google Scholar
Li Z Y, Gavrilyuk K, Gavves E, et al. VideoLSTM convolves, attends and flows for action recognition. Comput Vision Image Underst, 2018, 166: 41–50
Article Google Scholar
Zhang L, Zhu G M, Mei L, et al. Attention in convolutional LSTM for gesture recognition. In: Proceedings of Neural Information Processing Systems, 2018
Zhu G M, Zhang L, Shen P Y, et al. Multimodal gesture recognition using 3-D convolution and convolutional LSTM. IEEE Access, 2017, 5: 4517–4524
Article Google Scholar
Dai J F, Qi H Z, Xiong Y W, et al. Deformable convolutional networks. In: Proceedings of International Conference on Computer Vision, 2017
Chen J, Lian Z H, Wang Y Z, et al. Irregular scene text detection via attention guided border labeling. Sci China Inf Sci, 2019, 62: 220103
Article Google Scholar
Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the inception architecture for computer vision. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2016
Gupta A, Vedaldi A, Zisserman A. Synthetic data for text localization in natural images. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2016
Mishra A, Alahari K, Jawahar C V. Top-down and bottom-up cues for scene text recognition. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2012
Wang K, Babenko B, Belongie S. End-to-end scene text recognition. In: Proceedings of International Conference on Computer Vision, 2011
Risnumawan A, Shivakumara P, Chan C S, et al. A robust arbitrary text detection system for natural scene images. Expert Syst Appl, 2014, 41: 8027–8048
Article Google Scholar
Jaderberg M, Simonyan K, Vedaldi A, et al. Synthetic data and artificial neural networks for natural scene text recognition. 2014. ArXiv: 1412.1842

Download references

Acknowledgements

This work was supported by China Scholarship Council (Grant No. 201706140138), Shanghai Natural Science Foundation (Grant No. 19ZR1415900), and Shanghai Knowledge Service Platform Project (Grant No. ZF1213).

Author information

Authors and Affiliations

Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University, Shanghai, 200241, China
Qingqing Wang, Shujing Lyu & Yue Lu
Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, 2007, Australia
Qingqing Wang, Ye Huang, Wenjing Jia, Xiangjian He & Michael Blumenstein
Shanghai Institute of Intelligent Science and Technology, Tongji University, Shanghai, 200092, China
Yue Lu

Authors

Qingqing Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ye Huang
View author publications
You can also search for this author in PubMed Google Scholar
Wenjing Jia
View author publications
You can also search for this author in PubMed Google Scholar
Xiangjian He
View author publications
You can also search for this author in PubMed Google Scholar
Michael Blumenstein
View author publications
You can also search for this author in PubMed Google Scholar
Shujing Lyu
View author publications
You can also search for this author in PubMed Google Scholar
Yue Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yue Lu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, Q., Huang, Y., Jia, W. et al. FACLSTM: ConvLSTM with focused attention for scene text recognition. Sci. China Inf. Sci. 63, 120103 (2020). https://doi.org/10.1007/s11432-019-2713-1

Download citation

Received: 30 July 2019
Revised: 08 October 2019
Accepted: 12 November 2019
Published: 15 January 2020
DOI: https://doi.org/10.1007/s11432-019-2713-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FACLSTM: ConvLSTM with focused attention for scene text recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

PIEED: Position information enhanced encoder-decoder framework for scene text recognition

Revolutionizing Scene Text Recognition: Unleashing the Power of Dual Step Attention Mechanism in the Encoder

Character Flow Detection and Rectification for Scene Text Spotting

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

FACLSTM: ConvLSTM with focused attention for scene text recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

PIEED: Position information enhanced encoder-decoder framework for scene text recognition

Revolutionizing Scene Text Recognition: Unleashing the Power of Dual Step Attention Mechanism in the Encoder

Character Flow Detection and Rectification for Scene Text Spotting

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation