Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3581783.3611791acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Fine-grained Pseudo Labels for Scene Text Recognition

Published: 27 October 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Pseudo-Labeling based semi-supervised learning has shown promising advantages in Scene Text Recognition (STR). Most of them usually use a pre-trained model to generate sequence-level pseudo labels for text images and then re-train the model. Recently, conducting Pseudo-Labeling in a teacher-student framework (a student model is supervised by the pseudo labels from a teacher model) has become increasingly popular, which trains in an end-to-end manner and yields outstanding performance in semi-supervised learning. However, applying this framework directly to Pseudo-Labeling STR exhibits unstable convergence, as generating pseudo labels at the coarse-grained sequence-level leads to inefficient utilization of unlabelled data. Furthermore, the inherent domain shift between labeled and unlabeled data results in low quality of derived pseudo labels. To mitigate the above issues, we propose a novel Cross-domain Pseudo-Labeling (CPL) approach for scene text recognition, which makes better utilization of unlabeled data at the character-level and provides more accurate pseudo labels. Specifically, our proposed Pseudo-Labeled Curriculum Learning dynamically adjusts the thresholds for different character classes according to the model's learning status. Moreover, an Adaptive Distribution Regularizer is employed to bridge the domain gap and improve the quality of pseudo labels. Extensive experiments show that CPL boosts those representative STR models to achieve state-of-the-art results on six challenging STR benchmarks. Besides, it can be effectively generalized to handwritten text.

    References

    [1]
    Aviad Aberdam, Roy Ganz, Shai Mazor, and Ron Litman. 2022. Multimodal semi-supervised learning for text recognition. arXiv preprint arXiv:2205.03873 (2022).
    [2]
    Aviad Aberdam, Ron Litman, Shahar Tsiper, Oron Anschel, Ron Slossberg, Shai Mazor, R. Manmatha, and Pietro Perona. 2021. Sequence-to-Sequence Contrastive Learning for Text Recognition. In CVPR.
    [3]
    Abulikemu Abuduweili, Xingjian Li, Humphrey Shi, Cheng-Zhong Xu, and Dejing Dou. 2021. Adaptive consistency regularization for semi-supervised transfer learning. In CVPR.
    [4]
    Eric Arazo, Diego Ortego, Paul Albert, Noel E O'Connor, and Kevin McGuinness. 2020. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In IJCNN.
    [5]
    Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh, and Hwalsuk Lee. 2019. What is wrong with scene text recognition model comparisons? dataset and model analysis. In ICCV.
    [6]
    Jeonghun Baek, Yusuke Matsui, and Kiyoharu Aizawa. 2021. What if we only use real datasets for scene text recognition? toward scene text recognition with fewer labels. In CVPR.
    [7]
    Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In ICML.
    [8]
    Karsten M Borgwardt, Arthur Gretton, Malte J Rasch, Hans-Peter Kriegel, Bernhard Schölkopf, and Alex J Smola. 2006. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics (2006).
    [9]
    Zhanzhan Cheng, Fan Bai, Yunlu Xu, Gang Zheng, Shiliang Pu, and Shuigeng Zhou. 2017. Focusing attention: Towards accurate text recognition in natural images. In ICCV.
    [10]
    Shancheng Fang, Hongtao Xie, Yuxin Wang, Zhendong Mao, and Yongdong Zhang. 2021. Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition. In CVPR.
    [11]
    Yunze Gao, Yingying Chen, Jinqiao Wang, and Hanqing Lu. 2021. Semi-supervised scene text recognition. IEEE TIP (2021).
    [12]
    Chen Gong, Dacheng Tao, Stephen J Maybank, Wei Liu, Guoliang Kang, and Jie Yang. 2016. Multi-modal curriculum learning for semi-supervised image classification. IEEE TIP (2016).
    [13]
    Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML.
    [14]
    Tongkun Guan, Chaochen Gu, Jingzheng Tu, Xue Yang, Qi Feng, Yudi Zhao, and Wei Shen. 2022. Self-supervised Implicit Glyph Attention for Text Recognition. arxiv: 2203.03382 [cs.CV]
    [15]
    Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic data for text localisation in natural images. In CVPR.
    [16]
    Y. He, C. Chen, J. Zhang, J. Liu, F. He, C. Wang, and B. Du. 2022. Visual Semantics Allow for Textual Reasoning Better in Scene Text Recognition. In AAAI.
    [17]
    W. Hu, X. Cai, J. Hou, S. Yi, and Z. Lin. 2020. GTC: Guided Training of CTC Towards Efficient and Accurate Scene Text Recognition. In AAAI.
    [18]
    Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Synthetic data and artificial neural networks for natural scene text recognition. In NIPS Workshop.
    [19]
    Klara Janouskova, Jiri Matas, Lluis Gomez, and Dimosthenis Karatzas. 2021. Text recognition-real world data and where to find them. In ICPR.
    [20]
    Lei Kang, Marcc al Rusinol, Alicia Fornés, Pau Riba, and Mauricio Villegas. 2020. Unsupervised writer adaptation for synthetic-to-real handwritten word recognition. In WACV.
    [21]
    Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, et al. 2015. ICDAR 2015 competition on robust reading. In ICDAR.
    [22]
    Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Almazan Almazan, and Lluis Pere De Las Heras. 2013. ICDAR 2013 robust reading competition. In ICDAR.
    [23]
    Samuli Laine and Timo Aila. 2016. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242 (2016).
    [24]
    Van Der Maaten Laurens and Geoffrey Hinton. 2008. Visualizing Data using t-SNE. JMLR (2008).
    [25]
    Chen-Yu Lee and Simon Osindero. 2016. Recursive recurrent nets with attention modeling for OCR in the wild. In CVPR.
    [26]
    Dong-Hyun Lee et al. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML.
    [27]
    Yu-Jhe Li, Xiaoliang Dai, Chih-Yao Ma, Yen-Cheng Liu, Kan Chen, Bichen Wu, Zijian He, Kris Kitani, and Peter Vajda. 2022. Cross-domain adaptive teacher for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7581--7590.
    [28]
    Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. 2017. Deep transfer learning with joint adaptation networks. In ICML.
    [29]
    Canjie Luo, Lianwen Jin, and Zenghui Sun. 2019. MORAN: A Multi-Object Rectified Attention Network for Scene Text Recognition. PR (2019).
    [30]
    U. V. Marti and H. Bunke. 2002. The IAM-database: an English sentence database for offline handwriting recognition. International Journal on Document Analysis & Recognition (2002).
    [31]
    Anand Mishra, Karteek Alahari, and CV Jawahar. 2012. Scene text recognition using higher order language priors. In BMVC.
    [32]
    Yongqiang Mou, Lei Tan, Hui Yang, Jingying Chen, Leyuan Liu, Rui Yan, and Yaohong Huang. 2020. PlugNet: Degradation Aware Scene Text Recognition Supervised by a Pluggable Super-Resolution Unit. In ECCV.
    [33]
    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS.
    [34]
    Gaurav Patel, Jan P Allebach, and Qiang Qiu. 2023. Seq-UPS: Sequential Uncertainty-aware Pseudo-label Selection for Semi-Supervised Text Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 6180--6190.
    [35]
    Trung Quy Phan, Palaiahnakote Shivakumara, Shangxuan Tian, and Chew Lim Tan. 2013. Recognizing text with perspective distortion in natural scenes. In ICCV.
    [36]
    Anhar Risnumawan, Palaiahankote Shivakumara, Chee Seng Chan, and Chew Lim Tan. 2014. A robust arbitrary text detection system for natural scene images. Expert Systems with Applications (2014).
    [37]
    Mamshad Nayeem Rizve, Kevin Duarte, Yogesh S Rawat, and Mubarak Shah. 2021. In defense of pseudo-labeling: An uncertainty-aware pseudo-label selection framework for semi-supervised learning. arXiv preprint arXiv:2101.06329 (2021).
    [38]
    Baoguang Shi, Xiang Bai, and Cong Yao. 2017. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE TPAMI (2017).
    [39]
    Baoguang Shi, Mingkun Yang, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2018. ASTER: An Attentional Scene Text Recognizer with Flexible Rectification. IEEE TPAMI (2018).
    [40]
    Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. 2020. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In NeurIPS.
    [41]
    Bolan Su and Shijian Lu. 2017. Accurate recognition of words in scenes without character segmentation using recurrent neural network. PR (2017).
    [42]
    Antti Tarvainen and Harri Valpola. 2017a. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In NeurIPS.
    [43]
    A Tarvainen and H Valpola. 2017b. Weight-averaged consistency targets improve semi-supervised deep learning results. CoRR abs/1703.01780. arXiv preprint arXiv:1703.01780, Vol. 1, 5 (2017).
    [44]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS.
    [45]
    Jianfeng Wang and Xiaolin Hu. 2017. Gated recurrent convolution neural network for ocr. In NeurIPS.
    [46]
    Kai Wang, Boris Babenko, and Serge Belongie. 2011. End-to-end scene text recognition. In ICCV.
    [47]
    Tianwei Wang, Yuanzhi Zhu, Lianwen Jin, Canjie Luo, Xiaoxue Chen, Yaqiang Wu, Qianying Wang, and Mingxiang Cai. 2020. Decoupled Attention Network for Text Recognition. In AAAI.
    [48]
    Yidong Wang, Hao Chen, Qiang Heng, Wenxin Hou, Marios Savvides, Takahiro Shinozaki, Bhiksha Raj, Zhen Wu, and Jindong Wang. 2022a. Freematch: Self-adaptive thresholding for semi-supervised learning. arXiv preprint arXiv:2205.07246 (2022).
    [49]
    Yuchao Wang, Haochen Wang, Yujun Shen, Jingjing Fei, Wei Li, Guoqiang Jin, Liwei Wu, Rui Zhao, and Xinyi Le. 2022b. Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-Labels. In CVPR.
    [50]
    Yuxin Wang, Hongtao Xie, Shancheng Fang, Jing Wang, Shenggao Zhu, and Yongdong Zhang. 2021. From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network. In ICCV.
    [51]
    Deli Yu, Xuan Li, Chengquan Zhang, Tao Liu, Junyu Han, Jingtuo Liu, and Errui Ding. 2020b. Towards Accurate Scene Text Recognition With Semantic Reasoning Networks. In CVPR.
    [52]
    Qing Yu, Daiki Ikami, Go Irie, and Kiyoharu Aizawa. 2020a. Multi-task curriculum framework for open-set semi-supervised learning. In ECCV.
    [53]
    Xiaoyu Yue, Zhanghui Kuang, Chenhao Lin, Hongbin Sun, and Wayne Zhang. 2020. RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition. In ECCV.
    [54]
    Matthew D Zeiler. 2012. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012).
    [55]
    Bowen Zhang, Yidong Wang, Wenxin Hou, Hao Wu, Jindong Wang, Manabu Okumura, and Takahiro Shinozaki. 2021. Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. In NeurIPS.
    [56]
    Ying Zhang, Lionel Gueguen, Ilya Zharkov, Peter Zhang, Keith Seifert, and Ben Kadlec. 2017. Uber-text: A large-scale dataset for optical character recognition from street-level imagery. In SUNw: Scene Understanding Workshop-CVPR, Vol. 2017. 5.
    [57]
    Yaping Zhang, Shuai Nie, Wenju Liu, Xing Xu, Dongxiang Zhang, and Heng Tao Shen. 2019. Sequence-To-Sequence Domain Adaptation Network for Robust Text Image Recognition. In CVPR.
    [58]
    Caiyuan Zheng, Hui Li, Seon-Min Rhee, Seungju Han, Jae-Joon Han, and Peng Wang. 2022. Pushing the Performance Limit of Scene Text Recognizer without Human Annotation. In CVPR.
    [59]
    Tianlun Zheng, Zhineng Chen, Shancheng Fang, Hongtao Xie, and Yu-Gang Jiang. 2021. Cdistnet: Perceiving multi-domain character distance for robust text recognition. arXiv preprint arXiv:2111.11011 (2021).

    Index Terms

    1. Fine-grained Pseudo Labels for Scene Text Recognition

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MM '23: Proceedings of the 31st ACM International Conference on Multimedia
      October 2023
      9913 pages
      ISBN:9798400701085
      DOI:10.1145/3581783
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 October 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. domain shift
      2. pseudo labels
      3. scene text recognition

      Qualifiers

      • Research-article

      Conference

      MM '23
      Sponsor:
      MM '23: The 31st ACM International Conference on Multimedia
      October 29 - November 3, 2023
      Ottawa ON, Canada

      Acceptance Rates

      Overall Acceptance Rate 995 of 4,171 submissions, 24%

      Upcoming Conference

      MM '24
      The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 143
        Total Downloads
      • Downloads (Last 12 months)143
      • Downloads (Last 6 weeks)6
      Reflects downloads up to 11 Aug 2024

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media