research-article

Decoupling Recognition from Detection: Single Shot Self-Reliant Scene Text Spotter

Authors:

Chengquan Zhang,

Wenjie PeiAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 1319 - 1328

https://doi.org/10.1145/3503161.3548266

Published: 10 October 2022 Publication History

Abstract

Typical text spotters follow the two-stage spotting strategy: detect the precise boundary for a text instance first and then perform text recognition within the located text region. While such strategy has achieved substantial progress, there are two underlying limitations. 1) The performance of text recognition depends heavily on the precision of text detection, resulting in the potential error propagation from detection to recognition. 2) The RoI cropping which bridges the detection and recognition brings noise from background and leads to information loss when pooling or interpolating from feature maps. In this work we propose the single shot Self-Reliant Scene Text Spotter (SRSTS), which circumvents these limitations by decoupling recognition from detection. Specifically, we conduct text detection and recognition in parallel and bridge them by the shared positive anchor point. Consequently, our method is able to recognize the text instances correctly even though the precise text boundaries are challenging to detect. Additionally, our method reduces the annotation cost for text detection substantially. Extensive experiments on regular-shaped benchmark and arbitrary-shaped benchmark demonstrate that our SRSTS compares favorably to previous state-of-the-art spotters in terms of both accuracy and efficiency.

Supplementary Material

MP4 File (MM22-fp2205.mp4)

This is the presentation video of ACM-MM 2022 Poster paper "Decoupling Recognition from Detection: Single Shot Self-Reliant Scene Text Spotter". In this work, we propose to decouple recognition from detection and thereby reduce the interdependence between them.

Download
248.87 MB

References

[1]

Youngmin Baek, Seung Shin, Jeonghun Baek, Sungrae Park, Junyeop Lee, Daehyun Nam, and Hwalsuk Lee. 2020. Character region attention for text spotting. In ECCV. Springer, 504--521.

[2]

Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. 2019. Yolact: Real-time instance segmentation. In CVPR. 9157--9166.

[3]

Fred L. Bookstein. 1989. Principal warps: Thin-plate splines and the decomposition of deformations. IEEE Transactions on pattern analysis and machine intelligence, Vol. 11, 6 (1989), 567--585.

Digital Library

[4]

Michal Busta, Lukas Neumann, and Jiri Matas. 2017. Deep textspotter: An end-to-end trainable scene text localization and recognition framework. In ICCV. 2204--2212.

[5]

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2017. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, Vol. 40, 4 (2017), 834--848.

[6]

Chee Kheng Ch'ng and Chee Seng Chan. 2017. Total-text: A comprehensive dataset for scene text detection and recognition. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR), Vol. 1. IEEE, 935--942.

[7]

Chee Kheng Chng, Yuliang Liu, Yipeng Sun, Chun Chet Ng, Canjie Luo, Zihan Ni, ChuanMing Fang, Shuaitao Zhang, Junyu Han, Errui Ding, et al. 2019. Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1571--1576.

[8]

Shancheng Fang, Hongtao Xie, Yuxin Wang, Zhendong Mao, and Yongdong Zhang. 2021. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In CVPR. 7098--7107.

[9]

Wei Feng, Wenhao He, Fei Yin, Xu-Yao Zhang, and Cheng-Lin Liu. 2019. Textdragon: An end-to-end framework for arbitrary shaped text spotting. In ICCV. 9076--9085.

[10]

Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML. 369--376.

[11]

Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic data for text localisation in natural images. In CVPR. 2315--2324.

[12]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.

[13]

Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2016. Reading text in the wild with convolutional neural networks. International journal of computer vision, Vol. 116, 1 (2016), 1--20.

Digital Library

[14]

Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, et al. 2015. ICDAR 2015 competition on robust reading. In 2015 13th international conference on document analysis and recognition (ICDAR). IEEE, 1156--1160.

[15]

Hui Li, Peng Wang, and Chunhua Shen. 2017. Towards end-to-end text spotting with convolutional recurrent neural networks. In ICCV. 5238--5246.

[16]

Minghui Liao, Pengyuan Lyu, Minghang He, Cong Yao, Wenhao Wu, and Xiang Bai. 2019. Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 43, 2 (2019), 532--548.

Digital Library

[17]

Minghui Liao, Guan Pang, Jing Huang, Tal Hassner, and Xiang Bai. 2020. Mask textspotter v3: Segmentation proposal network for robust scene text spotting. In ECCV. Springer, 706--722.

[18]

Minghui Liao, Baoguang Shi, and Xiang Bai. 2018. Textboxes: A single-shot oriented scene text detector. IEEE Transactions on Image Processing, Vol. 27, 8 (2018), 3676--3690.

[19]

Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. 2017. Textboxes: A fast text detector with a single deep neural network. In Proceedings of the AAAI Conference on Artificial Intelligence.

[20]

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector. In ECCV. Springer, 21--37.

[21]

Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, and Junjie Yan. 2018. Fots: Fast oriented text spotting with a unified network. In CVPR. 5676--5685.

[22]

Yuliang Liu, Hao Chen, Chunhua Shen, Tong He, Lianwen Jin, and Liangwei Wang. 2020. Abcnet: Real-time scene text spotting with adaptive bezier-curve network. In CVPR. 9809--9818.

[23]

Yuliang Liu, Chunhua Shen, Lianwen Jin, Tong He, Peng Chen, Chongyu Liu, and Hao Chen. 2021. ABCNet v2: Adaptive bezier-curve network for real-time end-to-end text spotting. arXiv preprint arXiv:2105.03620 (2021).

[24]

Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. 2018a. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In ECCV. 67--83.

[25]

Pengyuan Lyu, Cong Yao, Wenhao Wu, Shuicheng Yan, and Xiang Bai. 2018b. Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. 7553--7563.

[26]

Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. 2016. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV). IEEE, 565--571.

[27]

Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas, Zhenbo Luo, Umapada Pal, Christophe Rigaud, Joseph Chazalon, et al. 2017. Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1. IEEE, 1454--1459.

[28]

Liang Qiao, Ying Chen, Zhanzhan Cheng, Yunlu Xu, Yi Niu, Shiliang Pu, and Fei Wu. 2020a. Mango: a mask attention guided one-stage scene text spotter. arXiv preprint arXiv:2012.04350 (2020).

[29]

Liang Qiao, Sanli Tang, Zhanzhan Cheng, Yunlu Xu, Yi Niu, Shiliang Pu, and Fei Wu. 2020b. Text perceptron: Towards end-to-end arbitrary-shaped text spotting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11899--11907.

[30]

Siyang Qin, Alessandro Bissacco, Michalis Raptis, Yasuhisa Fujii, and Ying Xiao. 2019. Towards unconstrained end-to-end text spotting. In ICCV. 4704--4714.

[31]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS. 91--99.

[32]

Baoguang Shi, Xiang Bai, and Cong Yao. 2016. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 39, 11 (2016), 2298--2304.

Digital Library

[33]

Baoguang Shi, Mingkun Yang, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2019. ASTER: An Attentional Scene Text Recognizer with Flexible Rectification. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 41, 9 (2019), 2035--2048.

[34]

Yipeng Sun, Chengquan Zhang, Zuming Huang, Jiaming Liu, Junyu Han, and Errui Ding. 2018. Textnet: Irregular text reading from images with an end-to-end trainable network. In Asian Conference on Computer Vision. Springer, 83--99.

[35]

Mingxing Tan, Ruoming Pang, and Quoc V Le. 2020. Efficientdet: Scalable and efficient object detection. In CVPR. 10781--10790.

[36]

Andreas Veit, Tomas Matera, Lukas Neumann, Jiri Matas, and Serge Belongie. 2016. Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140 (2016).

[37]

Hao Wang, Pu Lu, Hui Zhang, Mingkun Yang, Xiang Bai, Yongchao Xu, Mengchao He, Yongpan Wang, and Wenyu Liu. 2020a. All you need is boundary: Toward arbitrary-shaped text spotting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12160--12167.

[38]

Kai Wang, Boris Babenko, and Serge Belongie. 2011. End-to-end scene text recognition. In 2011 International conference on computer vision. IEEE, 1457--1464.

[39]

Peng Wang, Hui Li, and Chunhua Shen. 2021a. Towards end-to-end text spotting in natural scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).

[40]

Pengfei Wang, Chengquan Zhang, Fei Qi, Shanshan Liu, Xiaoqiang Zhang, Pengyuan Lyu, Junyu Han, Jingtuo Liu, Errui Ding, and Guangming Shi. 2021c. PGNet: Real-time Arbitrarily-Shaped Text Spotting with Point Gathering Network. In Proceedings of the AAAI Conference on Artificial Intelligence. 2782--2790.

[41]

Wenhai Wang, Enze Xie, Xiang Li, Xuebo Liu, Ding Liang, Yang Zhibo, Tong Lu, and Chunhua Shen. 2021b. PAN: towards efficient and accurate End-to-End spotting of arbitrarily-shaped text. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).

[42]

Yuxin Wang, Hongtao Xie, Zhengjun Zha, Youliang Tian, Zilong Fu, and Yongdong Zhang. 2020b. R-Net: A relationship network for efficient and accurate scene text detection. IEEE Transactions on Multimedia, Vol. 23 (2020), 1316--1329.

Digital Library

[43]

Linjie Xing, Zhi Tian, Weilin Huang, and Matthew R Scott. 2019. Convolutional character networks. In CVPR. 9126--9136.

[44]

Yongchao Xu, Yukang Wang, Wei Zhou, Yongpan Wang, Zhibo Yang, and Xiang Bai. 2019. Textfield: Learning a deep direction field for irregular scene text detection. IEEE Transactions on Image Processing, Vol. 28, 11 (2019), 5566--5579.

Digital Library

[45]

Jiahui Yu, Yuning Jiang, Zhangyang Wang, Zhimin Cao, and Thomas Huang. 2016. Unitbox: An advanced object detection network. In ACM MM. 516--520.

Digital Library

[46]

Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. 2017. East: an efficient and accurate scene text detector. In CVPR. 5551--5560.

Cited By

Zhang SYang CZhu XZhou HWang HYin X(2024)Inverse-Like Antagonistic Scene Text Spotting via Reading-Order Estimation and Dynamic SamplingIEEE Transactions on Image Processing10.1109/TIP.2024.335239933(825-839)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TIP.2024.3352399
Zhao LWilsbacher GWang S(2023)CommuSpotter: Scene Text Spotting with Multi-Task CommunicationApplied Sciences10.3390/app13231254013:23(12540)Online publication date: 21-Nov-2023
https://doi.org/10.3390/app132312540
Liu YZhang JPeng DHuang MWang XTang JHuang CLin DShen CBai XJin L(2023)SPTS v2: Single-Point Scene Text SpottingIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.331228545:12(15665-15679)Online publication date: 5-Sep-2023
https://dl.acm.org/doi/10.1109/TPAMI.2023.3312285
Show More Cited By

Index Terms

Decoupling Recognition from Detection: Single Shot Self-Reliant Scene Text Spotter
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Optical character recognition

Recommendations

A Robust Ensemble of ResNets for Character Level End-to-end Text Detection in Natural Scene Images
CBMI '17: Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing

Detecting text in natural scene images is a challenging task. In this paper, we propose a character-level end-to-end text detection algorithm in natural scene images. In general, text detection tasks are categorized into three parts: text localization, ...
End-to-End Text Recognition Using Local Ternary Patterns, MSER and Deep Convolutional Nets
SBES '13: Proceedings of the 2013 27th Brazilian Symposium on Software Engineering

Text recognition in natural scene images is an application for several computer vision applications like licence plate recognition, automated translation of street signs, help for visually impaired people or image retrieval. In this work an end-to-end ...
Scene text detection and recognition: a survey
Abstract
Scene text detection and recognition have been given a lot of attention in recent years and have been used in many vision-based applications. In this field, there are various types of challenges, including images with wavy text, images with text ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

October 2022

7537 pages

ISBN:9781450392037

DOI:10.1145/3503161

General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '22

Sponsor:

SIGMM

MM '22: The 30th ACM International Conference on Multimedia

October 10 - 14, 2022

Lisboa, Portugal

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
167
Total Downloads

Downloads (Last 12 months)46
Downloads (Last 6 weeks)2

Reflects downloads up to 02 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang SYang CZhu XZhou HWang HYin X(2024)Inverse-Like Antagonistic Scene Text Spotting via Reading-Order Estimation and Dynamic SamplingIEEE Transactions on Image Processing10.1109/TIP.2024.335239933(825-839)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TIP.2024.3352399
Zhao LWilsbacher GWang S(2023)CommuSpotter: Scene Text Spotting with Multi-Task CommunicationApplied Sciences10.3390/app13231254013:23(12540)Online publication date: 21-Nov-2023
https://doi.org/10.3390/app132312540
Liu YZhang JPeng DHuang MWang XTang JHuang CLin DShen CBai XJin L(2023)SPTS v2: Single-Point Scene Text SpottingIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.331228545:12(15665-15679)Online publication date: 5-Sep-2023
https://dl.acm.org/doi/10.1109/TPAMI.2023.3312285
Wang HZhou H(2023)Chinese Text Spotter Exploiting Spatial Semantic Information in Scene Text Images2023 5th International Conference on Robotics and Computer Vision (ICRCV)10.1109/ICRCV59470.2023.10329042(204-208)Online publication date: 15-Sep-2023
https://doi.org/10.1109/ICRCV59470.2023.10329042
Gujjeti SSriram MGanesan V(2023)Scene Text Detection with Big Bang Crunch Optimized Deep Learning2023 3rd International Conference on Mobile Networks and Wireless Communications (ICMNWC)10.1109/ICMNWC60182.2023.10435912(1-7)Online publication date: 4-Dec-2023
https://doi.org/10.1109/ICMNWC60182.2023.10435912
Huang MZhang JPeng DLu HHuang CLiu YBai XJin L(2023)ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01786(19438-19448)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.01786
Ye MZhang JZhao SLiu JLiu TDu BTao D(2023)DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.01854(19348-19357)Online publication date: Jun-2023
https://doi.org/10.1109/CVPR52729.2023.01854

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents