Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Scene Text Recognition via Dual-path Network with Shape-driven Attention Alignment

Published: 11 January 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Scene text recognition (STR), one typical sequence-to-sequence problem, has drawn much attention recently in multimedia applications. To guarantee good performance, it is essential for STR to obtain aligned character-wise features from the whole-image feature maps. While most present works adopt fully data-driven attention-based alignment, such practice ignores specific character geometric information. In this article, built upon a group of learnable geometric points, we propose a novel shape-driven attention alignment method that is able to obtain character-wise features. Concretely, we first design a corner detector to generate a shape map to guide the attention alignments explicitly, where a series of points can be learned to represent character-wise features flexibly. We then propose a dual-path network with a mutual learning and cooperating strategy that successfully combines CNN with a ViT-based model, leading to further accuracy improvement. We conduct extensive experiments to evaluate the proposed method on various scene text benchmarks, including six popular regular and irregular datasets, two more challenging datasets (i.e., WordArt and OST), and three Chinese datasets. Experimental results indicate that our method can achieve superior performance with a comparable model size against many state-of-the-art models.

    References

    [1]
    Fan Bai, Zhanzhan Cheng, Yi Niu, Shiliang Pu, and Shuigeng Zhou. 2018. Edit probability for scene text recognition. In Proceedings of the CVPR. 1508–1516.
    [2]
    Darwin Bautista and Rowel Atienza. 2022. Scene text recognition with permuted autoregressive sequence models. In Proceedings of the European Conference on Computer Vision. Springer, 178–196.
    [3]
    Ayan Kumar Bhunia, Aneeshan Sain, Amandeep Kumar, Shuvozit Ghose, Pinaki Nath Chowdhury, and Yi-Zhe Song. 2021. Joint visual semantic reasoning: Multi-stage decoder for text recognition. In Proceedings of the CVPR. 14940–14949.
    [4]
    Xiaohang Bian, Bo Qin, Xiaozhe Xin, Jianwu Li, Xuefeng Su, and Yanfeng Wang. 2022. Handwritten mathematical expression recognition via attention aggregation based bi-directional mutual learning. In Proceedings of the AAAI. 113–121.
    [5]
    Jingye Chen, Bin Li, and Xiangyang Xue. 2021. Scene text telescope: Text-focused scene image super-resolution. In Proceedings of the CVPR. 12026–12035.
    [6]
    Jingye Chen, Haiyang Yu, Jianqi Ma, Mengnan Guan, Xixi Xu, Xiaocong Wang, Shaobo Qu, Bin Li, and Xiangyang Xue. 2021. Benchmarking chinese text recognition: Datasets, baselines, and an empirical study. CoRR abs/2112.15093 (2021).
    [7]
    Zhanzhan Cheng, Fan Bai, Yunlu Xu, Gang Zheng, Shiliang Pu, and Shuigeng Zhou. 2017. Focusing attention: Towards accurate text recognition in natural images. In Proceedings of the ICCV. 5076–5084.
    [8]
    Cheng Da, Peng Wang, and Cong Yao. 2022. Levenshtein OCR. In Proceedings of the ECCV. 322–338.
    [9]
    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth \(16\times 16\) Words: Transformers for Image Recognition at Scale. In ICLR.
    [10]
    Yongkun Du, Zhineng Chen, Caiyan Jia, Xiaoting Yin, Tianlun Zheng, Chenxia Li, Yuning Du, and Yu-Gang Jiang. 2022. SVTR: Scene Text Recognition with a Single Visual Model. In IJCAI. 884–890.
    [11]
    Shancheng Fang, Hongtao Xie, Yuxin Wang, Zhendong Mao, and Yongdong Zhang. 2021. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In Proceedings of the CVPR. 7098–7107.
    [12]
    Mudasir A. Ganaie, Minghui Hu, A. K. Malik, M. Tanveer, and P. N. Suganthan. 2022. Ensemble deep learning: A review. Engineering Applications of Artificial Intelligence 115 (2022), 105151.
    [13]
    Yue He, Chen Chen, Jing Zhang, Juhua Liu, Fengxiang He, Chaoyue Wang, and Du.Bo. 2022. Visual semantics allow for textual reasoning better in scene text recognition. In Proceedings of the AAAI. 888–896.
    [14]
    Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. 2015. Distilling the knowledge in a neural network. arXiv:1503.02531. Retrieved from https://arxiv.org/abs/1503.02531
    [15]
    Yijie Hu, Bin Dong, Qiufeng Wang, Lei Ding, Xiaobo Jin, and Kaizhu Huang. 2022. Towards accurate alignment and sufficient context in scene text recognition. In Proceedings of the International Conference on Neural Information Processing. Springer, 705–717.
    [16]
    Kaizhu Huang, Amir Hussain, Qiu-Feng Wang, and Rui Zhang. 2019. Deep Learning: Fundamentals, Theory and Applications. Springer.
    [17]
    Masakazu Iwamura. 2018. Advances of Scene Text Datasets. CoRR abs/1812.05219 (2018).
    [18]
    Cheng Ju, Aurélien Bibaut, and Mark van der Laan. 2018. The relative performance of ensemble methods with deep convolutional neural networks for image classification. Journal of Applied Statistics 45, 15 (2018), 2800–2818.
    [19]
    Louisa Lam and S. Y. Suen. 1997. Application of majority voting to pattern recognition: An analysis of its behavior and performance. IEEE T-SMC 27, 5 (1997), 553–568.
    [20]
    Chen-Yu Lee, Anurag Bhardwaj, Wei Di, Vignesh Jagadeesh, and Robinson Piramuthu. 2014. Region-based discriminative feature pooling for scene text recognition. In Proceedings of the CVPR. 4050–4057.
    [21]
    Hui Li, Peng Wang, Chunhua Shen, and Guyu Zhang. 2019. Show, attend and read: A simple and strong baseline for irregular text recognition. In Proceedings of the AAAI. 8610–8617.
    [22]
    Jing Li, Qiu-Feng Wang, Rui Zhang, and Kaizhu Huang. 2020. Adversarial rectification network for scene text regularization. In Proceedings of the ICONIP. 152–163.
    [23]
    Cheng-Lin Liu, Fei Yin, Da-Han Wang, and Qiu-Feng Wang. 2011. CASIA online and offline Chinese handwriting databases. In Proceedings of the 2011 International Conference on Document Analysis and Recognition. IEEE, 37–41.
    [24]
    Yuliang Liu, Zhang Li, Hongliang Li, Wenwen Yu, Mingxin Huang, Dezhi Peng, Mingyu Liu, Mingrui Chen, Chunyuan Li, Lianwen Jin, and Xiang Bai. 2023. On the hidden mystery of ocr in large multimodal models. CoRR abs/2305.07895 (2023).
    [25]
    Simon M. Lucas, Alex Panaretos, Luis Sosa, Anthony Tang, Shirley Wong, Robert Young, Kazuki Ashida, Hiroki Nagai, Masayuki Okamoto, Hiroaki Yamamoto, Hidetoshi Miyao, JunMin Zhu, WuWen Ou, Christian Wolf, Jean-Michel Jolion, Leon Todoran, Marcel Worring, and Xiaofan Lin. 2005. ICDAR 2003 robust reading competitions: entries, results, and future directions. International Journal of Document Analysis and Recognition (IJDAR) 7, 2–3 (2005), 105–122.
    [26]
    Canjie Luo, Lianwen Jin, and Zenghui Sun. 2019. Moran: A multi-object rectified attention network for scene text recognition. Pattern Recognition 90 (2019), 109–118.
    [27]
    Zhuang Qian, Kaizhu Huang, Qiu-Feng Wang, and Xu-Yao Zhang. 2022. A survey of robust adversarial training in pattern recognition: Fundamental, theory, and methodologies. Pattern Recognition 131 (2022), 108889.
    [28]
    Zhi Qiao, Yu Zhou, Jin Wei, Wei Wang, Yuan Zhang, Ning Jiang, Hongbin Wang, and Weiping Wang. 2021. PIMNet: a parallel, iterative and mimicking network for scene text recognition. In Proceedings of the 29th ACM International Conference on Multimedia. 2046–2055.
    [29]
    Zhi Qiao, Yu Zhou, Dongbao Yang, Yucan Zhou, and Weiping Wang. 2020. Seed: Semantics enhanced encoder-decoder framework for scene text recognition. In Proceedings of the CVPR. 13528–13537.
    [30]
    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 8748–8763.
    [31]
    Baoguang Shi, Xiang Bai, and Cong Yao. 2017. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE T-PAMI 39, 11 (2017), 2298–2304.
    [32]
    Baoguang Shi, Mingkun Yang, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2018. Aster: An attentional scene text recognizer with flexible rectification. IEEE T-PAMI 41, 9 (2018), 2035–2048.
    [33]
    Jianbo Shi and Carlo Tomasis. 1994. Good features to track. In Proceedings of the CVPR. 593–600.
    [34]
    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the CVPR. 1–9.
    [35]
    Jingqun Tang, Wenqing Zhang, Hongye Liu, MingKun Yang, Bo Jiang, Guanglong Hu, and Xiang Bai. 2022. Few could be better than all: Feature sampling and grouping for scene text detection. In Proceedings of the CVPR. 4563–4572.
    [36]
    Xin Tang, Diao Liang, Wang Jun, Fang Rui, Xie Guotong, and Chen Weifu. 2022. Visual-Semantic Transformer for Scene Text Recognition. In BMVC. 772.
    [37]
    Zhaoyi Wan, Jielei Zhang, Liang Zhang, Jiebo Luo, and Cong Yao. 2020. On vocabulary reliance in scene text recognition. In Proceedings of the CVPR. 11425–11434.
    [38]
    Qiu-Feng Wang, Fei Yin, and Cheng-Lin Liu. 2012. “Handwritten chinese text recognition by integrating multiple contexts”. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 8 (2012), 1469–1481.
    [39]
    Tao Wang, David J. Wu, Adam Coates, and Andrew Y. Ng. 2012. End-to-end text recognition with convolutional neural networks. In Proceedings of the ICPR. 3304–3308.
    [40]
    Tianwei Wang, Yuanzhi Zhu, Lianwen Jin, Canjie Luo, Xiaoxue Chen, Yaqiang Wu, Qianying Wang, and Mingxiang Cai. 2020. Decoupled attention network for text recognition. In Proceedings of the AAAI. 12216–12224.
    [41]
    Yuxin Wang, Hongtao Xie, Shancheng Fang, Jing Wang, Shenggao Zhu, and Yongdong Zhang. 2021. From two to one: A new scene text recognizer with visual language modeling network. In Proceedings of the ICCV. 14194–14203.
    [42]
    Hongtao Xie, Shancheng Fang, Zheng-Jun Zha, Yating Yang, Yan Li, and Yongdong Zhang. 2019. Convolutional attention networks for scene text recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1s (2019), 1–17.
    [43]
    Xudong Xie, Ling Fu, Zhifei Zhang, Zhaowen Wang, and Xiang Bai. 2022. Toward understanding WordArt: Corner-guided transformer for scene text recognition. In Proceedings of the ECCV. 303–321.
    [44]
    Chenggang Yan, Biao Gong, Yuxuan Wei, and Yue Gao. 2020. Deep multi-view enhancement hashing for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 4 (2020), 1445–1451.
    [45]
    Chenggang Yan, Yiming Hao, Liang Li, Jian Yin, Anan Liu, Zhendong Mao, Zhenyu Chen, and Xingyu Gao. 2021. Task-adaptive attention for image captioning. IEEE Transactions on Circuits and Systems for Video technology 32, 1 (2021), 43–51.
    [46]
    Chenggang Yan, Zhisheng Li, Yongbing Zhang, Yutao Liu, Xiangyang Ji, and Yongdong Zhang. 2020. Depth image denoising using nuclear norm and learning graph model. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 4 (2020), 1–17.
    [47]
    Chenggang Yan, Lixuan Meng, Liang Li, Jiehua Zhang, Zhan Wang, Jian Yin, Jiyong Zhang, Yaoqi Sun, and Bolun Zheng. 2022. Age-invariant face recognition by multi-feature fusionand decomposition with self-attention. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 1s (2022), 1–18.
    [48]
    Chenggang Yan, Tong Teng, Yutao Liu, Yongbing Zhang, Haoqian Wang, and Xiangyang Ji. 2021. Precise no-reference image quality evaluation based on distortion identification. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 3s (2021), 1–21.
    [49]
    Ruijie Yan, Liangrui Peng, Shanyu Xiao, and Gang Yao. 2021. Primitive representation learning for scene text recognition. In Proceedings of the CVPR. 284–293.
    [50]
    Deli Yu, Xuan Li, Chengquan Zhang, Tao Liu, Junyu Han, Jingtuo Liu, and Errui Ding. 2020. Towards accurate scene text recognition with semantic reasoning networks. In Proceedings of the CVPR. 12113–12122.
    [51]
    Xinyun Zhang, Binwu Zhu, Xufeng Yao, Qi Sun, Ruiyu Li, and Bei Yu. 2022. Context-based contrastive learning for scene text recognition. In AAAI, Vol. 36, 3353–3361.
    [52]
    Ying Zhang, Tao Xiang, Timothy M. Hospedales, and Huchuan Lu. 2018. Deep mutual learning. In Proceedings of the CVPR. 4320–4328.
    [53]
    Zheng Zhang, Chengquan Zhang, Wei Shen, Cong Yao, Wenyu Liu, and Xiang Bai. 2016. Multi-oriented text detection with fully convolutional networks. In Proceedings of the CVPR. 4159–4167.
    [54]
    Shuai Zhao, Xiaohan Wang, Linchao Zhu, and Yi Yang. 2023. CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model. CoRR abs/2305.14014 (2023).
    [55]
    Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2021. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR.

    Index Terms

    1. Scene Text Recognition via Dual-path Network with Shape-driven Attention Alignment

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 4
      April 2024
      676 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3613617
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 January 2024
      Online AM: 21 November 2023
      Accepted: 08 November 2023
      Revised: 08 October 2023
      Received: 14 July 2023
      Published in TOMM Volume 20, Issue 4

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. OCR
      2. scene text recognition
      3. deformable attention
      4. attention alignment
      5. dual path network

      Qualifiers

      • Research-article

      Funding Sources

      • National Natural Science Foundation of China
      • Jiangsu Science and Technology Programme (Natural Science Foundation of Jiangsu Province)
      • European Union’s Horizon 2020 research and innovation programme
      • UK EPSRC under projects

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 285
        Total Downloads
      • Downloads (Last 12 months)285
      • Downloads (Last 6 weeks)43
      Reflects downloads up to 26 Jul 2024

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media