Abstract
Scene text recognition has recently been widely treated as a sequence-to-sequence prediction problem, where traditional fully-connected-LSTM (FC-LSTM) has played a critical role. Owing to the limitation of FC-LSTM, existing methods have to convert 2-D feature maps into 1-D sequential feature vectors, resulting in severe damages of the valuable spatial and structural information of text images. In this paper, we argue that scene text recognition is essentially a spatiotemporal prediction problem for its 2-D image inputs, and propose a convolution LSTM (ConvLSTM)-based scene text recognizer, namely, FACLSTM, i.e., focused attention ConvLSTM, where the spatial correlation of pixels is fully leveraged when performing sequential prediction with LSTM. Particularly, the attention mechanism is properly incorporated into an efficient ConvLSTM structure via the convolutional operations and additional character center masks are generated to help focus attention on right feature areas. The experimental results on benchmark datasets IIIT5K, SVT and CUTE demonstrate that our proposed FACLSTM performs competitively on the regular, low-resolution and noisy text images, and outperforms the state-of-the-art approaches on the curved text images with large margins.
Similar content being viewed by others
References
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput, 1997, 9: 1735–1780
Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd International Conference on Learning Representations, 2015
Chorowski J, Bahdanau D, Serdyuk D, et al. Attention-based models for speech recognition. 2015. ArXiv: 1506.07503
Gao Y Z, Chen Y Y, Wang J Q, et al. Dense chained attention network for scene text recognition. In: Proceedings of International Conference on Image Processing, 2018
Cheng Z Z, Bai F, Xu Y L, et al. Focusing attention: towards accurate text recognition in natural images. In: Proceedings of IEEE International Conference on Computer Vision, 2017
Cheng Z Z, Xu Y L, Bai F, et al. AON: towards arbitrarily-oriented text recognition. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2018
Shi B G, Bai X, Yao C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell, 2017, 39: 2298–2304
Shi B G, Wang X G, Lyu P Y, et al. Robust scene text recognition with automatic rectification. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2016
Bartz C, Yang H J, Meinel C. STN-OCR: a single neural network for text detection and recognition. 2017. ArXiv: 1707.08831v1
Liao M H, Zhang J, Wan Z Y, et al. Scene text recognition from two-dimensional perspective. In: Proceedings of AAAI Conference on Artificial Intelligence, 2019
Shi X J, Chen Z R, Wang H, et al. Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Proceedings of Neural Information Processing Systems, 2015
Gao Y Z, Chen Y Y, Wang J Q, et al. Reading scene text with attention convolutional sequence modeling. 2017. ArXiv: 1709.04303v1
Wojna Z, Gorban A, Lee D, et al. Attention-based extraction of structured information from street view imagery. In: Proceedings of International Conference on Document Analysis and Recognition, 2017
Liu M, Zhu M L. Mobile video object detection with temporally-aware feature maps. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2018
Ye Q X, Doermann D. Text detection and recognition in imagery: a survey. IEEE Trans Pattern Anal Mach Intell, 2015, 37: 1480–1500
Shi B G, Yang M K, Wang X G, et al. ASTER: an attentional scene text recognizer with flexible rectification. IEEE Trans Pattern Anal Mach Intell, 2019, 41: 2035–2048
Lee C Y, Osindero S. Recursive recurrent nets with attention modeling for OCR in the wild. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2016
Bai X, Liao M K, Shi B G, et al. Deep learning for scene text detection and recognition (in Chinese). Sci Sin Inform, 2018, 48: 531–544
Jaderberg M, Simonyan K, Zisserman A, et al. Spatial transformer networks. 2015. ArXiv: 1506.02025
Bai F, Cheng Z Z, Niu Y, et al. Edit probability for scene text recognition. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2018
Su B L, Lu S J. Accurate recognition of words in scenes without character segmentation using recurrent neural network. Pattern Recogn, 2017, 63: 397–405
Su B L, Lu S J. Accurate scene text recognition based on recurrent neural network. In: Proceedings of Asian Conference on Computer Vision, 2014
Li H, Wang P, Shen C H, et al. Show, attend and read: a simple and strong baseline for irregular text recognition. In: Proceedings of AAAI Conference on Artificial Intelligence, 2019
Jaderberg M, Vedaldi A, Zisserman A. Deep features for text spotting. In: Proceedings of European Conference on Computer Vision, 2014
Tian S X, Bhattacharya U, Lu S J, et al. Multilingual scene character recognition with co-occurrence of histogram of oriented gradients. Pattern Recogn, 2016, 51: 125–134
Liu Z C, Li Y X, Ren F B, et al. SqueezedText: a real-time scene text recognition by binary convolutional encoderdecoder network. In: Proceedings of AAAI Conference on Artificial Intelligence, 2018
Huang T J, Tian Y H, Li J, et al. Salient region detection and segmentation for general object recognition and image understanding. Sci China Inf Sci, 2011, 54: 2461–2470
Li Z Y, Gavrilyuk K, Gavves E, et al. VideoLSTM convolves, attends and flows for action recognition. Comput Vision Image Underst, 2018, 166: 41–50
Zhang L, Zhu G M, Mei L, et al. Attention in convolutional LSTM for gesture recognition. In: Proceedings of Neural Information Processing Systems, 2018
Zhu G M, Zhang L, Shen P Y, et al. Multimodal gesture recognition using 3-D convolution and convolutional LSTM. IEEE Access, 2017, 5: 4517–4524
Dai J F, Qi H Z, Xiong Y W, et al. Deformable convolutional networks. In: Proceedings of International Conference on Computer Vision, 2017
Chen J, Lian Z H, Wang Y Z, et al. Irregular scene text detection via attention guided border labeling. Sci China Inf Sci, 2019, 62: 220103
Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the inception architecture for computer vision. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2016
Gupta A, Vedaldi A, Zisserman A. Synthetic data for text localization in natural images. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2016
Mishra A, Alahari K, Jawahar C V. Top-down and bottom-up cues for scene text recognition. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2012
Wang K, Babenko B, Belongie S. End-to-end scene text recognition. In: Proceedings of International Conference on Computer Vision, 2011
Risnumawan A, Shivakumara P, Chan C S, et al. A robust arbitrary text detection system for natural scene images. Expert Syst Appl, 2014, 41: 8027–8048
Jaderberg M, Simonyan K, Vedaldi A, et al. Synthetic data and artificial neural networks for natural scene text recognition. 2014. ArXiv: 1412.1842
Acknowledgements
This work was supported by China Scholarship Council (Grant No. 201706140138), Shanghai Natural Science Foundation (Grant No. 19ZR1415900), and Shanghai Knowledge Service Platform Project (Grant No. ZF1213).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, Q., Huang, Y., Jia, W. et al. FACLSTM: ConvLSTM with focused attention for scene text recognition. Sci. China Inf. Sci. 63, 120103 (2020). https://doi.org/10.1007/s11432-019-2713-1
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-019-2713-1