research-article

DPGS: : Cross-cooperation guided dynamic points generation for scene text spotting

Authors:

Yanning ZhangAuthors Info & Claims

Volume 302, Issue C

https://doi.org/10.1016/j.knosys.2024.112399

Published: 25 October 2024 Publication History

Abstract

End-to-end text spotting aims to combine scene text detection and recognition into a unified framework. Dealing with the relationship between the two sub-tasks plays a pivotal role in designing effective spotters. While polygon or segmentation-based methods eliminate heuristic post-processing, they still face challenges such as background noise and high computational burden. In this study, we introduce DPGS, a coarse-to-fine learning framework that lets Dynamic Points Generation for text Spotting. DPGS simultaneously learns character representations for both detection and recognition tasks. Specifically, for each text instance, we represent the character sequence as ordered points and model them with learnable point queries. This approach progressively selects appropriate key points covering character and leverages group attention to associate similar information from different positions, improving detection accuracy. After passing through a single decoder, the point queries encode text semantics and locations, facilitating decoding to central line, boundary, script, and confidence of text through simple prediction heads. Additionally, we introduce an adaptive cooperative criterion to combine more useful feature knowledge, enhancing training efficiency. Extensive experiments show the superiority of our DPGS when handling scene text detection and recognition tasks. Compared to their respective top-1 methods, DPGS has significantly improved the average recognition accuracy by 3.7%, 1.9%, and 0.7% on the Total-Text, ICDAR15, and CTW1500 datasets, respectively.

References

[1]

Zhang C., Tao Y., Du K., Ding W., Wang B., Liu J., Wang W., Character-level street view text spotting based on deep multisegmentation network for smarter autonomous driving, IEEE Trans. Artif. Intell. 3 (2) (2021) 297–308.

[2]

M. Huang, Y. Liu, Z. Peng, C. Liu, D. Lin, S. Zhu, N. Yuan, K. Ding, L. Jin, Swintextspotter: Scene text spotting via better synergy between text detection and text recognition, in: IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 4593–4603.

[3]

X. Zhang, Y. Su, S. Tripathi, Z. Tu, Text spotting transformers, in: IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 9519–9528.

[4]

M. Ye, J. Zhang, S. Zhao, J. Liu, B. Du, D. Tao, Dptext-detr: Towards better scene text detection with dynamic points in transformer, in: AAAI Conference on Artificial Intelligence, 2023, pp. 3241–3249.

[5]

Y. Liu, H. Chen, C. Shen, T. He, L. Jin, L. Wang, Abcnet: Real-time scene text spotting with adaptive bezier-curve network, in: IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 9809–9818.

[6]

Liu Y., Shen C., Jin L., He T., Chen P., Liu C., Chen H., Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting, IEEE Trans. Pattern Anal. Mach. Intell. 44 (11) (2021) 8048–8064.

[7]

Huang M., Peng D., Li H., Peng Z., Liu C., Lin D., Liu Y., Bai X., Jin L., SwinTextSpotter v2: Towards better synergy for scene text spotting, 2024, arXiv preprint arXiv:2401.07641.

[8]

Zhu X., Su W., Lu L., Li B., Wang X., Dai J., Deformable detr: Deformable transformers for end-to-end object detection, 2020, arXiv preprint arXiv:2010.04159.

[9]

Jaderberg M., Simonyan K., Vedaldi A., Zisserman A., Reading text in the wild with convolutional neural networks, Int. J. Comput. Vis. 116 (2016) 1–20.

Digital Library

[10]

A. Gupta, A. Vedaldi, A. Zisserman, Synthetic data for text localisation in natural images, in: IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2315–2324.

[11]

Shi B., Bai X., Yao C., An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition, IEEE Trans. Pattern Anal. Mach. Intell. 39 (11) (2016) 2298–2304.

Digital Library

[12]

H. Li, P. Wang, C. Shen, Towards end-to-end text spotting with convolutional recurrent neural networks, in: IEEE International Conference on Computer Vision, 2017, pp. 5238–5246.

[13]

W. Feng, W. He, F. Yin, X.-Y. Zhang, C.-L. Liu, Textdragon: An end-to-end framework for arbitrary shaped text spotting, in: IEEE International Conference on Computer Vision, 2019, pp. 9076–9085.

[14]

H. Wang, P. Lu, H. Zhang, M. Yang, X. Bai, Y. Xu, M. He, Y. Wang, W. Liu, All you need is boundary: Toward arbitrary-shaped text spotting, in: AAAI Conference on Artificial Intelligence, 2020, pp. 12160–12167.

[15]

D. Peng, X. Wang, Y. Liu, J. Zhang, M. Huang, S. Lai, J. Li, S. Zhu, D. Lin, C. Shen, et al., Spts: Single-point text spotting, in: ACM International Conference on Multimedia, 2022, pp. 4272–4281.

[16]

Liu Y., Zhang J., Peng D., Huang M., Wang X., Tang J., Huang C., Lin D., Shen C., Bai X., et al., Spts v2: single-point scene text spotting, IEEE Trans. Pattern Anal. Mach. Intell. 45 (12) (2023) 15665–15679.

[17]

P. Dai, S. Zhang, H. Zhang, X. Cao, Progressive contour regression for arbitrary-shape scene text detection, in: IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 7393–7402.

[18]

S.-X. Zhang, X. Zhu, C. Yang, H. Wang, X.-C. Yin, Adaptive boundary proposal network for arbitrary shape text detection, in: IEEE International Conference on Computer Vision, 2021, pp. 1305–1314.

[19]

Q. Dong, Z. Tu, H. Liao, Y. Zhang, V. Mahadevan, S. Soatto, Visual relationship detection using part-and-sum transformers with composite queries, in: IEEE International Conference on Computer Vision, 2021, pp. 3550–3559.

[20]

Phillips G.M., Phillips G.M., Bernstein polynomials, Interpolat. Approx. Polyn. (2003) 247–290.

[21]

Kuhn H.W., The hungarian method for the assignment problem, Naval Res. Logist. Q. 2 (1–2) (1955) 83–97.

[22]

M. Ye, J. Zhang, S. Zhao, J. Liu, T. Liu, B. Du, D. Tao, Deepsolo: Let transformer decoder with explicit points solo for text spotting, in: IEEE Conference on Computer Vision and Pattern Recognition, 2023, pp. 19348–19357.

[23]

T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in: IEEE International Conference on Computer Vision, 2017, pp. 2980–2988.

[24]

Ch’ng C.-K., Chan C.S., Liu C.-L., Total-text: toward orientation robustness in scene text detection, Int. J. Document Anal. Recognit. 23 (1) (2020) 31–52.

[25]

D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V.R. Chandrasekhar, S. Lu, et al., ICDAR 2015 competition on robust reading, in: International Conference on Document Analysis and Recognition, 2015, pp. 1156–1160.

[26]

Liu Y., Jin L., Zhang S., Luo C., Zhang S., Curved scene text detection via transverse and longitudinal sequence connection, Pattern Recognit. 90 (2019) 337–345.

Digital Library

[27]

Loshchilov I., Hutter F., Decoupled weight decay regularization, 2017, arXiv preprint arXiv:1711.05101.

[28]

Wang W., Xie E., Li X., Liu X., Liang D., Yang Z., Lu T., Shen C., Pan++: Towards efficient and accurate end-to-end spotting of arbitrarily-shaped text, IEEE Trans. Pattern Anal. Mach. Intell. 44 (9) (2021) 5349–5367.

[29]

Kim S., Shin S., Kim Y., Cho H.-C., Kil T., Surh J., Park S., Lee B., Baek Y., Deer: Detection-agnostic end-to-end recognizer for scene text spotting, 2022, arXiv preprint arXiv:2203.05122.

[30]

R. Ronen, S. Tsiper, O. Anschel, I. Lavi, A. Markovitz, R. Manmatha, Glass: Global to local attention for scene-text spotting, in: European Conference on Computer Vision, 2022, pp. 249–266.

[31]

Lu P., Wang H., Zhu S., Wang J., Bai X., Liu W., Boundary textspotter: Toward arbitrary-shaped scene text spotting, IEEE Trans. Image Process. 31 (2022) 6200–6212.

[32]

M. Fujitake, A3s: Adversarial learning of semantic representations for scene-text spotting, in: IEEE International Conference on Acoustics, Speech and Signal Processing, 2023, pp. 1–5.

[33]

Das A., Biswas S., Pal U., Lladós J., Diving into the depths of spotting text in multi-domain noisy scenes, 2023, arXiv preprint arXiv:2310.00558.

[34]

A. Das, S. Biswas, A. Banerjee, J. Lladós, U. Pal, S. Bhattacharya, Harnessing the power of multi-lingual datasets for pre-training: Towards enhancing text spotting performance, in: IEEE Winter Conference on Applications of Computer Vision, 2024, pp. 718–728.

[35]

Zhang S.-X., Yang C., Zhu X., Zhou H., Wang H., Yin X.-C., Inverse-like antagonistic scene text spotting via reading-order estimation and dynamic sampling, IEEE Trans. Image Process. 33 (2024) 825–839.

[36]

N. Nayef, F. Yin, I. Bizid, H. Choi, Y. Feng, D. Karatzas, Z. Luo, U. Pal, C. Rigaud, J. Chazalon, et al., Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt, in: International Conference on Document Analysis and Recognition, 2017, pp. 1454–1459.

[37]

D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L.G. i Bigorda, S.R. Mestre, J. Mas, D.F. Mota, J.A. Almazan, L.P. De Las Heras, ICDAR 2013 robust reading competition, in: International Conference on Document Analysis and Recognition, 2013, pp. 1484–1493.

Index Terms

DPGS: Cross-cooperation guided dynamic points generation for scene text spotting
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Image segmentation
        Object detection
        Object recognition
      2. Computer vision tasks
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Word spotting in historical documents using primitive codebook and dynamic programming

Word searching and indexing in historical document collections are a challenging problem because text characters are often touching or broken due to degradation or aging effects. In this paper, we present a novel approach towards word spotting using ...
Context-Free TextSpotter for Real-Time and Mobile End-to-End Text Detection and Recognition
Document Analysis and Recognition – ICDAR 2021
Abstract
In the deployment of scene-text spotting systems on mobile platforms, lightweight models with low computation are preferable. In concept, end-to-end (E2E) text spotting is suitable for such purposes because it performs text detection and ...
Character and numeral recognition for non-Indic and Indic scripts: a survey
Abstract
A collection of different scripts is employed in writing languages throughout the world. Character and numeral recognition of a particular script is a key area in the field of pattern recognition. In this paper, we have presented a comprehensive ...

Comments

Information & Contributors

Information

Published In

cover image Knowledge-Based Systems

Knowledge-Based Systems Volume 302, Issue C

Oct 2024

670 pages

Issue’s Table of Contents

Elsevier B.V.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 25 October 2024

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents