research-article

Scene Text Recognition via Dual-path Network with Shape-driven Attention Alignment

Authors:

Qiu-Feng WangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 4

Article No.: 107, Pages 1 - 20

https://doi.org/10.1145/3633517

Published: 11 January 2024 Publication History

Abstract

Scene text recognition (STR), one typical sequence-to-sequence problem, has drawn much attention recently in multimedia applications. To guarantee good performance, it is essential for STR to obtain aligned character-wise features from the whole-image feature maps. While most present works adopt fully data-driven attention-based alignment, such practice ignores specific character geometric information. In this article, built upon a group of learnable geometric points, we propose a novel shape-driven attention alignment method that is able to obtain character-wise features. Concretely, we first design a corner detector to generate a shape map to guide the attention alignments explicitly, where a series of points can be learned to represent character-wise features flexibly. We then propose a dual-path network with a mutual learning and cooperating strategy that successfully combines CNN with a ViT-based model, leading to further accuracy improvement. We conduct extensive experiments to evaluate the proposed method on various scene text benchmarks, including six popular regular and irregular datasets, two more challenging datasets (i.e., WordArt and OST), and three Chinese datasets. Experimental results indicate that our method can achieve superior performance with a comparable model size against many state-of-the-art models.

References

[1]

Fan Bai, Zhanzhan Cheng, Yi Niu, Shiliang Pu, and Shuigeng Zhou. 2018. Edit probability for scene text recognition. In Proceedings of the CVPR. 1508–1516.

[2]

Darwin Bautista and Rowel Atienza. 2022. Scene text recognition with permuted autoregressive sequence models. In Proceedings of the European Conference on Computer Vision. Springer, 178–196.

Digital Library

[3]

Ayan Kumar Bhunia, Aneeshan Sain, Amandeep Kumar, Shuvozit Ghose, Pinaki Nath Chowdhury, and Yi-Zhe Song. 2021. Joint visual semantic reasoning: Multi-stage decoder for text recognition. In Proceedings of the CVPR. 14940–14949.

[4]

Xiaohang Bian, Bo Qin, Xiaozhe Xin, Jianwu Li, Xuefeng Su, and Yanfeng Wang. 2022. Handwritten mathematical expression recognition via attention aggregation based bi-directional mutual learning. In Proceedings of the AAAI. 113–121.

[5]

Jingye Chen, Bin Li, and Xiangyang Xue. 2021. Scene text telescope: Text-focused scene image super-resolution. In Proceedings of the CVPR. 12026–12035.

[6]

Jingye Chen, Haiyang Yu, Jianqi Ma, Mengnan Guan, Xixi Xu, Xiaocong Wang, Shaobo Qu, Bin Li, and Xiangyang Xue. 2021. Benchmarking chinese text recognition: Datasets, baselines, and an empirical study. CoRR abs/2112.15093 (2021).

[7]

Zhanzhan Cheng, Fan Bai, Yunlu Xu, Gang Zheng, Shiliang Pu, and Shuigeng Zhou. 2017. Focusing attention: Towards accurate text recognition in natural images. In Proceedings of the ICCV. 5076–5084.

[8]

Cheng Da, Peng Wang, and Cong Yao. 2022. Levenshtein OCR. In Proceedings of the ECCV. 322–338.

Digital Library

[9]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth \(16\times 16\) Words: Transformers for Image Recognition at Scale. In ICLR.

[10]

Yongkun Du, Zhineng Chen, Caiyan Jia, Xiaoting Yin, Tianlun Zheng, Chenxia Li, Yuning Du, and Yu-Gang Jiang. 2022. SVTR: Scene Text Recognition with a Single Visual Model. In IJCAI. 884–890.

[11]

Shancheng Fang, Hongtao Xie, Yuxin Wang, Zhendong Mao, and Yongdong Zhang. 2021. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In Proceedings of the CVPR. 7098–7107.

[12]

Mudasir A. Ganaie, Minghui Hu, A. K. Malik, M. Tanveer, and P. N. Suganthan. 2022. Ensemble deep learning: A review. Engineering Applications of Artificial Intelligence 115 (2022), 105151.

[13]

Yue He, Chen Chen, Jing Zhang, Juhua Liu, Fengxiang He, Chaoyue Wang, and Du.Bo. 2022. Visual semantics allow for textual reasoning better in scene text recognition. In Proceedings of the AAAI. 888–896.

[14]

Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. 2015. Distilling the knowledge in a neural network. arXiv:1503.02531. Retrieved from https://arxiv.org/abs/1503.02531

[15]

Yijie Hu, Bin Dong, Qiufeng Wang, Lei Ding, Xiaobo Jin, and Kaizhu Huang. 2022. Towards accurate alignment and sufficient context in scene text recognition. In Proceedings of the International Conference on Neural Information Processing. Springer, 705–717.

[16]

Kaizhu Huang, Amir Hussain, Qiu-Feng Wang, and Rui Zhang. 2019. Deep Learning: Fundamentals, Theory and Applications. Springer.

[17]

Masakazu Iwamura. 2018. Advances of Scene Text Datasets. CoRR abs/1812.05219 (2018).

[18]

Cheng Ju, Aurélien Bibaut, and Mark van der Laan. 2018. The relative performance of ensemble methods with deep convolutional neural networks for image classification. Journal of Applied Statistics 45, 15 (2018), 2800–2818.

[19]

Louisa Lam and S. Y. Suen. 1997. Application of majority voting to pattern recognition: An analysis of its behavior and performance. IEEE T-SMC 27, 5 (1997), 553–568.

[20]

Chen-Yu Lee, Anurag Bhardwaj, Wei Di, Vignesh Jagadeesh, and Robinson Piramuthu. 2014. Region-based discriminative feature pooling for scene text recognition. In Proceedings of the CVPR. 4050–4057.

Digital Library

[21]

Hui Li, Peng Wang, Chunhua Shen, and Guyu Zhang. 2019. Show, attend and read: A simple and strong baseline for irregular text recognition. In Proceedings of the AAAI. 8610–8617.

Digital Library

[22]

Jing Li, Qiu-Feng Wang, Rui Zhang, and Kaizhu Huang. 2020. Adversarial rectification network for scene text regularization. In Proceedings of the ICONIP. 152–163.

Digital Library

[23]

Cheng-Lin Liu, Fei Yin, Da-Han Wang, and Qiu-Feng Wang. 2011. CASIA online and offline Chinese handwriting databases. In Proceedings of the 2011 International Conference on Document Analysis and Recognition. IEEE, 37–41.

Digital Library

[24]

Yuliang Liu, Zhang Li, Hongliang Li, Wenwen Yu, Mingxin Huang, Dezhi Peng, Mingyu Liu, Mingrui Chen, Chunyuan Li, Lianwen Jin, and Xiang Bai. 2023. On the hidden mystery of ocr in large multimodal models. CoRR abs/2305.07895 (2023).

[25]

Simon M. Lucas, Alex Panaretos, Luis Sosa, Anthony Tang, Shirley Wong, Robert Young, Kazuki Ashida, Hiroki Nagai, Masayuki Okamoto, Hiroaki Yamamoto, Hidetoshi Miyao, JunMin Zhu, WuWen Ou, Christian Wolf, Jean-Michel Jolion, Leon Todoran, Marcel Worring, and Xiaofan Lin. 2005. ICDAR 2003 robust reading competitions: entries, results, and future directions. International Journal of Document Analysis and Recognition (IJDAR) 7, 2–3 (2005), 105–122.

[26]

Canjie Luo, Lianwen Jin, and Zenghui Sun. 2019. Moran: A multi-object rectified attention network for scene text recognition. Pattern Recognition 90 (2019), 109–118.

Digital Library

[27]

Zhuang Qian, Kaizhu Huang, Qiu-Feng Wang, and Xu-Yao Zhang. 2022. A survey of robust adversarial training in pattern recognition: Fundamental, theory, and methodologies. Pattern Recognition 131 (2022), 108889.

Digital Library

[28]

Zhi Qiao, Yu Zhou, Jin Wei, Wei Wang, Yuan Zhang, Ning Jiang, Hongbin Wang, and Weiping Wang. 2021. PIMNet: a parallel, iterative and mimicking network for scene text recognition. In Proceedings of the 29th ACM International Conference on Multimedia. 2046–2055.

Digital Library

[29]

Zhi Qiao, Yu Zhou, Dongbao Yang, Yucan Zhou, and Weiping Wang. 2020. Seed: Semantics enhanced encoder-decoder framework for scene text recognition. In Proceedings of the CVPR. 13528–13537.

[30]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 8748–8763.

[31]

Baoguang Shi, Xiang Bai, and Cong Yao. 2017. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE T-PAMI 39, 11 (2017), 2298–2304.

[32]

Baoguang Shi, Mingkun Yang, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2018. Aster: An attentional scene text recognizer with flexible rectification. IEEE T-PAMI 41, 9 (2018), 2035–2048.

[33]

Jianbo Shi and Carlo Tomasis. 1994. Good features to track. In Proceedings of the CVPR. 593–600.

[34]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the CVPR. 1–9.

[35]

Jingqun Tang, Wenqing Zhang, Hongye Liu, MingKun Yang, Bo Jiang, Guanglong Hu, and Xiang Bai. 2022. Few could be better than all: Feature sampling and grouping for scene text detection. In Proceedings of the CVPR. 4563–4572.

[36]

Xin Tang, Diao Liang, Wang Jun, Fang Rui, Xie Guotong, and Chen Weifu. 2022. Visual-Semantic Transformer for Scene Text Recognition. In BMVC. 772.

[37]

Zhaoyi Wan, Jielei Zhang, Liang Zhang, Jiebo Luo, and Cong Yao. 2020. On vocabulary reliance in scene text recognition. In Proceedings of the CVPR. 11425–11434.

[38]

Qiu-Feng Wang, Fei Yin, and Cheng-Lin Liu. 2012. “Handwritten chinese text recognition by integrating multiple contexts”. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 8 (2012), 1469–1481.

Digital Library

[39]

Tao Wang, David J. Wu, Adam Coates, and Andrew Y. Ng. 2012. End-to-end text recognition with convolutional neural networks. In Proceedings of the ICPR. 3304–3308.

[40]

Tianwei Wang, Yuanzhi Zhu, Lianwen Jin, Canjie Luo, Xiaoxue Chen, Yaqiang Wu, Qianying Wang, and Mingxiang Cai. 2020. Decoupled attention network for text recognition. In Proceedings of the AAAI. 12216–12224.

[41]

Yuxin Wang, Hongtao Xie, Shancheng Fang, Jing Wang, Shenggao Zhu, and Yongdong Zhang. 2021. From two to one: A new scene text recognizer with visual language modeling network. In Proceedings of the ICCV. 14194–14203.

[42]

Hongtao Xie, Shancheng Fang, Zheng-Jun Zha, Yating Yang, Yan Li, and Yongdong Zhang. 2019. Convolutional attention networks for scene text recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1s (2019), 1–17.

Digital Library

[43]

Xudong Xie, Ling Fu, Zhifei Zhang, Zhaowen Wang, and Xiang Bai. 2022. Toward understanding WordArt: Corner-guided transformer for scene text recognition. In Proceedings of the ECCV. 303–321.

Digital Library

[44]

Chenggang Yan, Biao Gong, Yuxuan Wei, and Yue Gao. 2020. Deep multi-view enhancement hashing for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 4 (2020), 1445–1451.

[45]

Chenggang Yan, Yiming Hao, Liang Li, Jian Yin, Anan Liu, Zhendong Mao, Zhenyu Chen, and Xingyu Gao. 2021. Task-adaptive attention for image captioning. IEEE Transactions on Circuits and Systems for Video technology 32, 1 (2021), 43–51.

Digital Library

[46]

Chenggang Yan, Zhisheng Li, Yongbing Zhang, Yutao Liu, Xiangyang Ji, and Yongdong Zhang. 2020. Depth image denoising using nuclear norm and learning graph model. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 4 (2020), 1–17.

Digital Library

[47]

Chenggang Yan, Lixuan Meng, Liang Li, Jiehua Zhang, Zhan Wang, Jian Yin, Jiyong Zhang, Yaoqi Sun, and Bolun Zheng. 2022. Age-invariant face recognition by multi-feature fusionand decomposition with self-attention. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 1s (2022), 1–18.

Digital Library

[48]

Chenggang Yan, Tong Teng, Yutao Liu, Yongbing Zhang, Haoqian Wang, and Xiangyang Ji. 2021. Precise no-reference image quality evaluation based on distortion identification. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 3s (2021), 1–21.

Digital Library

[49]

Ruijie Yan, Liangrui Peng, Shanyu Xiao, and Gang Yao. 2021. Primitive representation learning for scene text recognition. In Proceedings of the CVPR. 284–293.

[50]

Deli Yu, Xuan Li, Chengquan Zhang, Tao Liu, Junyu Han, Jingtuo Liu, and Errui Ding. 2020. Towards accurate scene text recognition with semantic reasoning networks. In Proceedings of the CVPR. 12113–12122.

[51]

Xinyun Zhang, Binwu Zhu, Xufeng Yao, Qi Sun, Ruiyu Li, and Bei Yu. 2022. Context-based contrastive learning for scene text recognition. In AAAI, Vol. 36, 3353–3361.

[52]

Ying Zhang, Tao Xiang, Timothy M. Hospedales, and Huchuan Lu. 2018. Deep mutual learning. In Proceedings of the CVPR. 4320–4328.

[53]

Zheng Zhang, Chengquan Zhang, Wei Shen, Cong Yao, Wenyu Liu, and Xiang Bai. 2016. Multi-oriented text detection with fully convolutional networks. In Proceedings of the CVPR. 4159–4167.

[54]

Shuai Zhao, Xiaohan Wang, Linchao Zhu, and Yi Yang. 2023. CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model. CoRR abs/2305.14014 (2023).

[55]

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2021. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR.

Index Terms

Scene Text Recognition via Dual-path Network with Shape-driven Attention Alignment
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding

Recommendations

Towards Accurate Alignment and Sufficient Context in Scene Text Recognition
Neural Information Processing
Abstract
Encoder-decoder framework has recently become cutting-edge in scene text recognition (STR), where most decoder networks consist of two parts: an attention model to align visual features from the encoder for each character, and a linear or LSTM-...
Thai Scene Text Recognition with Character Combination
Pattern Recognition and Computer Vision
Abstract
In recent years, scene text recognition(STR) that recognizing character sequences in natural images is in great demand beyond various fields. However, most STR studies only focus on popular scripts like Chinese or English, too little attention has ...
Deep neural network with attention model for scene text recognition

The authors present a deep neural network (DNN) with attention model for scene text recognition. The proposed model does not require any segmentation of the input text image. The framework is inspired by the attention model presented recently for speech ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 20, Issue 4

April 2024

676 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3613617

Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 January 2024

Online AM: 21 November 2023

Accepted: 08 November 2023

Revised: 08 October 2023

Received: 14 July 2023

Published in TOMM Volume 20, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Jiangsu Science and Technology Programme (Natural Science Foundation of Jiangsu Province)
European Union’s Horizon 2020 research and innovation programme
UK EPSRC under projects

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
285
Total Downloads

Downloads (Last 12 months)285
Downloads (Last 6 weeks)43

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents