Towards Accurate Alignment and Sufficient Context in Scene Text Recognition

Hu, Yijie; Dong, Bin; Wang, Qiufeng; Ding, Lei; Jin, Xiaobo; Huang, Kaizhu

doi:10.1007/978-3-031-30111-7_59

Yijie Hu¹²,
Bin Dong¹³,
Qiufeng Wang¹²,
Lei Ding¹³,
Xiaobo Jin¹² &
…
Kaizhu Huang¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13625))

Included in the following conference series:

International Conference on Neural Information Processing

1150 Accesses

Abstract

Encoder-decoder framework has recently become cutting-edge in scene text recognition (STR), where most decoder networks consist of two parts: an attention model to align visual features from the encoder for each character, and a linear or LSTM-based model to predict label sequence. However, it is difficult for these attention models to obtain accurate alignment, and linear or LSTM model usually captures limited context. To emphasize the role of character feature alignment, we separate the attention alignment module from the decoder network in this work, forming an Encoder-Alignment-Decoder framework. Under this framework, we propose a deformable attention based model to accurately align visual features of each character. In this alignment model, we explicitly learn the spatial coordinate information of each character from the input reading order sequence and optimize it with learnable sampled offsets in the attention block to obtain accurate aligned features. To address the lack of context, we explore transformer-based decoder to capture global context by multi-head attention, where a mask matrix is integrated to keep attention weights focused on the relevant context during the decoding. Extensive experiments demonstrate the effectiveness of the Encoder-Alignment-Decoder framework in STR, achieving better performance than other language free methods with significant improvement on most benchmark STR datasets, and obtain the state-of-the-art performance on several datasets by integrating a language model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Representation and Correlation Enhanced Encoder-Decoder Framework for Scene Text Recognition

RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition

Cascade 2D attentional decoders with context-enhanced encoder for scene text recognition

Article 21 February 2024

References

Atienza, R.: Vision transformer for fast and efficient scene text recognition. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 319–334. Springer, Cham (2021)
Chapter Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICML (2014)
Google Scholar
Bhunia, A.K., Sain, A., Kumar, A., Ghose, S., Chowdhury, P.N., Song, Y.Z.: Joint visual semantic reasoning: multi-stage decoder for text recognition. In: ICCV, pp. 14940–14949 (2021)
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. In: ACL, vol. 5, pp. 135–146 (2017)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., Zhou, S.: Focusing attention: towards accurate text recognition in natural images. In: CVPR, pp. 5076–5084 (2017)
Google Scholar
Cho, K., et al.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Fang, S., Xie, H., Wang, Y., Mao, Z., Zhang, Y.: Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In: CVPR. pp. 7098–7107 (2021)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
He, Y., et al.: Visual semantics allow for textual reasoning better in scene text recognition. In: AAAI, pp. 888–896 (2022)
Google Scholar
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. In: NIPS (2014)
Google Scholar
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading text in the wild with convolutional neural networks. In: CVPR, pp. 1–20 (2016)
Google Scholar
Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: ICDAR, pp. 1156–1160. IEEE (2015)
Google Scholar
Karatzas, D., et al.: ICDAR 2013 robust reading competition. In: ICDAR, pp. 1484–1493. IEEE (2013)
Google Scholar
Li, H., Wang, P., Shen, C., Zhang, G.: Show, attend and read: a simple and strong baseline for irregular text recognition. In: AAAI, pp. 8610–8617 (2019)
Google Scholar
Liao, M., et al.: Scene text recognition from two-dimensional perspective. In: AAAI, pp. 8714–8721 (2019)
Google Scholar
Lu, N., et al.: MASTER: multi-aspect non-local network for scene text recognition. Pattern Recogn. 117, 107980 (2021)
Google Scholar
Lyu, P., Yang, Z., Leng, X., Wu, X., Li, R., Shen, X.: 2D attentional irregular scene text recognizer. arXiv preprint arXiv:1906.05708 (2019)
Mishra, A., Alahari, K., Jawahar, C.: Scene text recognition using higher order language priors. In: BMVC. BMVA (2012)
Google Scholar
Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L.: Recognizing text with perspective distortion in natural scenes. In: ICCV, pp. 569–576 (2013)
Google Scholar
Qiao, Z., Zhou, Y., Yang, D., Zhou, Y., Wang, W.: SEED: semantics enhanced encoder-decoder framework for scene text recognition. In: CVPR, pp. 13528–13537 (2020)
Google Scholar
Risnumawan, A., Shivakumara, P., Chan, C.S., Tan, C.L.: A robust arbitrary text detection system for natural scene images. Expert Syst. Appl. 41(18), 8027–8048 (2014)
Article Google Scholar
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2016)
Article Google Scholar
Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: ASTER: an attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell. 41(9), 2035–2048 (2018)
Article Google Scholar
Tang, X., Lai, Y., Liu, Y., Fu, Y., Fang, R.: Visual-semantic transformer for scene text recognition. In: AAAI (2022)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017)
Google Scholar
Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: ICCV, pp. 1457–1464. IEEE (2011)
Google Scholar
Wang, T., et al.: Decoupled attention network for text recognition. In: AAAI, pp. 12216–12224 (2020)
Google Scholar
Wang, Y., Xie, H., Fang, S., Wang, J., Zhu, S., Zhang, Y.: From two to one: a new scene text recognizer with visual language modeling network. In: ICCV, pp. 14194–14203 (2021)
Google Scholar
Xie, Z., Huang, Y., Zhu, Y., Jin, L., Liu, Y., Xie, L.: Aggregation cross-entropy for sequence recognition. In: CVPR, pp. 6538–6547 (2019)
Google Scholar
Yang, L., Wang, P., Li, H., Li, Z., Zhang, Y.: A holistic representation guided attention network for scene text recognition. Neurocomputing 414, 67–75 (2020)
Article Google Scholar
Yang, X., He, D., Zhou, Z., Kifer, D., Giles, C.L.: Learning to read irregular text with attention mechanisms. In: IJCAI, vol. 1, p. 3 (2017)
Google Scholar
Yu, D., et al.: Towards accurate scene text recognition with semantic reasoning networks. In: CVPR, pp. 12113–12122 (2020)
Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: ICLR (2020)
Google Scholar

Download references

Acknowledgements

The work was funded by National Natural Science Foundation of China under no.61876154 and no.61876155; and Jiangsu Science and Technology Programme under no. BE2020006-4, and “Qing Lan Project” in Jiangsu universities.

Author information

Authors and Affiliations

Xi’an Jiaotong-Liverpool University, Suzhou, China
Yijie Hu, Qiufeng Wang & Xiaobo Jin
Ricoh Software Research Center (Beijing) Co., Ltd., Beijing, China
Bin Dong & Lei Ding
Duke Kunshan University, Kunshan, China
Kaizhu Huang

Authors

Yijie Hu
View author publications
You can also search for this author in PubMed Google Scholar
Bin Dong
View author publications
You can also search for this author in PubMed Google Scholar
Qiufeng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lei Ding
View author publications
You can also search for this author in PubMed Google Scholar
Xiaobo Jin
View author publications
You can also search for this author in PubMed Google Scholar
Kaizhu Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qiufeng Wang .

Editor information

Editors and Affiliations

Indian Institute of Technology Indore, Indore, India
Mohammad Tanveer
Indian Institute of Information Technology - Allahabad, Prayagraj, India
Sonali Agarwal
Kobe University, Kobe, Japan
Seiichi Ozawa
Indian Institute of Technology Patna, Patna, India
Asif Ekbal
University of Innsbruck, Innsbruck, Austria
Adam Jatowt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hu, Y., Dong, B., Wang, Q., Ding, L., Jin, X., Huang, K. (2023). Towards Accurate Alignment and Sufficient Context in Scene Text Recognition. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Lecture Notes in Computer Science, vol 13625. Springer, Cham. https://doi.org/10.1007/978-3-031-30111-7_59

Download citation

DOI: https://doi.org/10.1007/978-3-031-30111-7_59
Published: 13 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30110-0
Online ISBN: 978-3-031-30111-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Towards Accurate Alignment and Sufficient Context in Scene Text Recognition

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Representation and Correlation Enhanced Encoder-Decoder Framework for Scene Text Recognition

RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition

Cascade 2D attentional decoders with context-enhanced encoder for scene text recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Towards Accurate Alignment and Sufficient Context in Scene Text Recognition

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Representation and Correlation Enhanced Encoder-Decoder Framework for Scene Text Recognition

RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition

Cascade 2D attentional decoders with context-enhanced encoder for scene text recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation