research-article

PBFormer: Capturing Complex Scene Text Shape with Polynomial Band Transformer

Authors:

Wei PengAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 2112 - 2120

https://doi.org/10.1145/3581783.3612059

Published: 27 October 2023 Publication History

Abstract

We present PBFormer, an efficient yet powerful scene text detector that unifies the transformer with a novel text shape representation Polynomial Band (PB). The representation has four polynomial curves to fit a text's top, bottom, left, and right sides, which can capture a text with a complex shape by varying polynomial coefficients. PB has appealing features compared with conventional representations: 1) It can model different curvatures with a fixed number of parameters, while polygon-points-based methods need to utilize a different number of points. 2) It can distinguish adjacent or overlapping texts as they have apparent different curve coefficients, while segmentation-based or points-based methods suffer from adhesive spatial positions. PBFormer combines the PB with the transformer, which can directly generate smooth text contours sampled from predicted curves without interpolation. A parameter-free cross-scale pixel attention (CPA) module is employed to highlight the feature map of a suitable scale while suppressing the other feature maps. The simple operation can help detect small-scale texts and is compatible with the one-stage DETR framework, where no postprocessing exists for NMS. Furthermore, PBFormer is trained with a shape-contained loss, which not only enforces the piecewise alignment between the ground truth and the predicted curves but also makes curves' position and shapes consistent with each other. Without bells and whistles about text pre-training, our method is superior to the previous state-of-the-art text detectors on the arbitrary-shaped text datasets. Codes will be public.

Supplementary Material

MP4 File (1636-video.mp4)

Presentation Video

Download
438.85 MB

References

[1]

Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee. 2019. Character Region Awareness for Text Detection. In CVPR. 9365--9374.

[2]

Youngmin Baek, Seung Shin, Jeonghun Baek, Sungrae Park, Junyeop Lee, Daehyun Nam, and Hwalsuk Lee. 2020. Character Region Attention for Text Spotting. In ECCV (29) (Lecture Notes in Computer Science, Vol. 12374). 504--521.

[3]

Chee Kheng Chng and Chee Seng Chan. 2017. Total-Text: A Comprehensive Dataset for Scene Text Detection and Recognition. In ICDAR. 935--942.

[4]

Chee Kheng Chng, Errui Ding, Jingtuo Liu, Dimosthenis Karatzas, Chee Seng Chan, Lianwen Jin, Yuliang Liu, Yipeng Sun, Chun Chet Ng, Canjie Luo, Zihan Ni, ChuanMing Fang, Shuaitao Zhang, and Junyu Han. 2019. ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text - RRC-ArT. In ICDAR.

[5]

Pengwen Dai, Sanyi Zhang, Hua Zhang, and Xiaochun Cao. 2021. Progressive Contour Regression for Arbitrary-Shape Scene Text Detection. In CVPR. 7393--7402.

[6]

Dan Deng, Haifeng Liu, Xuelong Li, and Deng Cai. 2018. PixelLink: Detecting Scene Text via Instance Segmentation. In AAAI. 6773--6780.

[7]

Wenyang Hu, Xiaocong Cai, Jun Hou, Shuai Yi, and Zhiping Lin. 2020. GTC: Guided Training of CTC towards Efficient and Accurate Scene Text Recognition. In AAAI. 11005--11012.

[8]

Hui Li, Peng Wang, and Chunhua Shen. 2017. Towards End-to-End Text Spotting with Convolutional Recurrent Neural Networks. In ICCV. 5248--5256.

[9]

Hui Li, Peng Wang, and Chunhua Shen. 2019. Towards End-to-End Text Spotting in Natural Scenes. CoRR, Vol. abs/1906.06013 (2019).

[10]

Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei A. F. Florê ncio, Cha Zhang, Zhoujun Li, and Furu Wei. 2023. TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models. In AAAI.

[11]

Minghui Liao, Pengyuan Lyu, Minghang He, Cong Yao, Wenhao Wu, and Xiang Bai. 2021. Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 43, 2 (2021), 532--548.

Digital Library

[12]

Minghui Liao, Guan Pang, Jing Huang, Tal Hassner, and Xiang Bai. 2020a. Mask TextSpotter v3: Segmentation Proposal Network for Robust Scene Text Spotting. In ECCV (11) (Lecture Notes in Computer Science, Vol. 12356). 706--722.

[13]

Minghui Liao, Baoguang Shi, and Xiang Bai. 2018a. TextBoxes: A Single-Shot Oriented Scene Text Detector. IEEE Trans. Image Process., Vol. 27, 8 (2018), 3676--3690.

[14]

Minghui Liao, Zhaoyi Wan, Cong Yao, Kai Chen, and Xiang Bai. 2020b. Real-Time Scene Text Detection with Differentiable Binarization. In AAAI. 11474--11481.

[15]

Minghui Liao, Zhen Zhu, Baoguang Shi, Gui-Song Xia, and Xiang Bai. 2018b. Rotation-Sensitive Regression for Oriented Scene Text Detection. In CVPR. 5909--5918.

[16]

Minghui Liao, Zhisheng Zou, Zhaoyi Wan, Cong Yao, and Xiang Bai. 2022. Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion. CoRR, Vol. abs/2202.10304 (2022).

[17]

Ruijin Liu, Dapeng Chen, Tie Liu, Zhiliang Xiong, and Zejian Yuan. 2022. Learning to Predict 3D Lane Shape and Camera Pose from a Single Image via Geometry Constraints. In AAAI. 1765--1772.

[18]

Ruijin Liu, Zejian Yuan, Tie Liu, and Zhiliang Xiong. 2021b. End-to-end Lane Shape Prediction with Transformers. In WACV. 3693--3701.

[19]

Yuliang Liu, Hao Chen, Chunhua Shen, Tong He, Lianwen Jin, and Liangwei Wang. 2020a. ABCNet: Real-Time Scene Text Spotting With Adaptive Bezier-Curve Network. In CVPR. 9806--9815.

[20]

Yuliang Liu, Lianwen Jin, and ChuanMing Fang. 2020b. Arbitrarily Shaped Scene Text Detection With a Mask Tightness Text Detector. IEEE Trans. Image Process., Vol. 29 (2020), 2918--2930.

Digital Library

[21]

Yuliang Liu, Lianwen Jin, Shuaitao Zhang, Canjie Luo, and Sheng Zhang. 2019a. Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognit., Vol. 90 (2019), 337--345.

Digital Library

[22]

Yuliang Liu, Chunhua Shen, Lianwen Jin, Tong He, Peng Chen, Chongyu Liu, and Hao Chen. 2021a. ABCNet v2: Adaptive Bezier-Curve Network for Real-time End-to-end Text Spotting. CoRR, Vol. abs/2105.03620 (2021).

[23]

Zichuan Liu, Guosheng Lin, Sheng Yang, Fayao Liu, Weisi Lin, and Wang Ling Goh. 2019b. Towards Robust Curve Text Detection With Conditional Spatial Expansion. In CVPR. 7269--7278.

[24]

Shangbang Long, Xin He, and Cong Yao. 2021. Scene Text Detection and Recognition: The Deep Learning Era. Int. J. Comput. Vis., Vol. 129, 1 (2021), 161--184.

Digital Library

[25]

Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, and Cong Yao. 2018. TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes. In ECCV (2), Vol. 11206. 19--35.

[26]

Chixiang Ma, Lei Sun, Zhuoyao Zhong, and Qiang Huo. 2021. ReLaText: Exploiting visual relationships for arbitrary-shaped scene text detection with graph convolutional networks. Pattern Recognit., Vol. 111 (2021), 107684.

[27]

Nibal Nayef, Cheng-Lin Liu, Jean-Marc Ogier, Yash Patel, Michal Busta, Pinaki Nath Chowdhury, Dimosthenis Karatzas, Wafa Khlif, Jiri Matas, Umapada Pal, and Jean-Christophe Burie. 2019. ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition - RRC-MLT-2019. In ICDAR.

[28]

Xugong Qin, Yu Zhou, Youhui Guo, Dayan Wu, Zhihong Tian, Ning Jiang, Hongbin Wang, and Weiping Wang. 2021. Mask is All You Need: Rethinking Mask R-CNN for Dense and Arbitrary-Shaped Scene Text Detection. In ACM Multimedia. 414--423.

[29]

Sangeeth Reddy, Minesh Mathew, Lluís Gómez, Marçal Rusiñol, Dimosthenis Karatzas, and C. V. Jawahar. 2020. RoadText-1K: Text Detection & Recognition Dataset for Driving Videos. In ICRA. 11074--11080.

[30]

Tao Sheng, Jie Chen, and Zhouhui Lian. 2021. CentripetalText: An Efficient Text Instance Representation for Scene Text Detection. In NeurIPS. 335--346.

[31]

Baoguang Shi, Xiang Bai, and Serge J. Belongie. 2017. Detecting Oriented Text in Natural Images by Linking Segments. In CVPR. 3482--3490.

[32]

Jun Tang, Zhibo Yang, Yongpan Wang, Qi Zheng, Yongchao Xu, and Xiang Bai. 2019. SegLink: Detecting Dense and Arbitrary-shaped Scene Text by Instance-aware Component Grouping. Pattern Recognit., Vol. 96 (2019).

Digital Library

[33]

Jingqun Tang, Wenqing Zhang, Hongye Liu, Mingkun Yang, Bo Jiang, Guanglong Hu, and Xiang Bai. 2022. Few Could Be Better Than All: Feature Sampling and Grouping for Scene Text Detection. In CVPR.

[34]

Zhuotao Tian, Michelle Shu, Pengyuan Lyu, Ruiyu Li, Chao Zhou, Xiaoyong Shen, and Jiaya Jia. 2019a. Learning Shape-Aware Embedding for Scene Text Detection. In CVPR. 4234--4243.

[35]

Zhuotao Tian, Michelle Shu, Pengyuan Lyu, Ruiyu Li, Chao Zhou, Xiaoyong Shen, and Jiaya Jia. 2019b. Learning Shape-Aware Embedding for Scene Text Detection. In CVPR. 4234--4243.

[36]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NIPS. 5998--6008.

[37]

Fangfang Wang, Yifeng Chen, Fei Wu, and Xi Li. 2020a. TextRay: Contour-based Geometric Modeling for Arbitrary-shaped Scene Text Detection. In ACM Multimedia. ACM, 111--119.

[38]

Hao Wang, Pu Lu, Hui Zhang, Mingkun Yang, Xiang Bai, Yongchao Xu, Mengchao He, Yongpan Wang, and Wenyu Liu. 2020b. All You Need Is Boundary: Toward Arbitrary-Shaped Text Spotting. In AAAI. 12160--12167.

[39]

Pengfei Wang, Chengquan Zhang, Fei Qi, Zuming Huang, Mengyi En, Junyu Han, Jingtuo Liu, Errui Ding, and Guangming Shi. 2019d. A Single-Shot Arbitrarily-Shaped Text Detector based on Context Attended Multi-Task Learning. In ACM Multimedia. 1277--1285.

[40]

Wenhai Wang, Enze Xie, Xiang Li, Wenbo Hou, Tong Lu, Gang Yu, and Shuai Shao. 2019b. Shape Robust Text Detection With Progressive Scale Expansion Network. In CVPR. 9336--9345.

[41]

Wenhai Wang, Enze Xie, Xiaoge Song, Yuhang Zang, Wenjia Wang, Tong Lu, Gang Yu, and Chunhua Shen. 2019c. Efficient and Accurate Arbitrary-Shaped Text Detection With Pixel Aggregation Network. In ICCV. IEEE, 8439--8448. https://doi.org/10.1109/ICCV.2019.00853

[42]

Wei Wang, Yu Zhou, Jiahao Lv, Dayan Wu, Guoqing Zhao, Ning Jiang, and Weiping Wang. 2022. TPSNet: Reverse Thinking of Thin Plate Splines for Arbitrary Shape Scene Text Representation. In ACM Multimedia. 5014--5025.

[43]

Xiaobing Wang, Yingying Jiang, Zhenbo Luo, Cheng-Lin Liu, Hyunsoo Choi, and Sungjin Kim. 2019a. Arbitrary Shape Scene Text Detection With Adaptive Text Region Representation. In CVPR. 6449--6458.

[44]

Yuxin Wang, Hongtao Xie, Zheng-Jun Zha, Mengting Xing, Zilong Fu, and Yongdong Zhang. 2020c. ContourNet: Taking a Further Step Toward Accurate Arbitrary-Shaped Scene Text Detection. In CVPR. 11750--11759.

[45]

Chuhui Xue, Shijian Lu, and Wei Zhang. 2019. MSR: Multi-Scale Shape Regression for Scene Text Detection. In IJCAI. 989--995.

[46]

Fangneng Zhan and Shijian Lu. 2019. ESIR: End-To-End Scene Text Recognition via Iterative Image Rectification. In CVPR. 2059--2068.

[47]

Chengquan Zhang, Borong Liang, Zuming Huang, Mengyi En, Junyu Han, Errui Ding, and Xinghao Ding. 2019. Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes. In CVPR. 10552--10561.

[48]

Shi-Xue Zhang, Xiaobin Zhu, Jie-Bo Hou, Chang Liu, Chun Yang, Hongfa Wang, and Xu-Cheng Yin. 2020. Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection. In CVPR. 9696--9705.

[49]

Shi-Xue Zhang, Xiaobin Zhu, Chun Yang, Hongfa Wang, and Xu-Cheng Yin. 2021. Adaptive Boundary Proposal Network for Arbitrary Shape Text Detection. In ICCV. 1285--1294.

[50]

Xiang Zhang, Yongwen Su, Subarna Tripathi, and Zhuowen Tu. 2022. Text Spotting Transformers. In CVPR.

[51]

Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. 2017. EAST: An Efficient and Accurate Scene Text Detector. In CVPR. IEEE Computer Society, 2642--2651.

[52]

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2021b. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In ICLR.

[53]

Yiqin Zhu, Jianyong Chen, Lingyu Liang, Zhanghui Kuang, Lianwen Jin, and Wayne Zhang. 2021a. Fourier Contour Embedding for Arbitrary-Shaped Text Detection. In CVPR. 3123--3131.

Index Terms

PBFormer: Capturing Complex Scene Text Shape with Polynomial Band Transformer
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Optical character recognition
2. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object detection

Recommendations

TPSNet: Reverse Thinking of Thin Plate Splines for Arbitrary Shape Scene Text Representation
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

The research focus of scene text detection and recognition has shifted to arbitrary shape text in recent years, where the text shape representation is a fundamental problem. An ideal representation should be compact, complete, efficient, and reusable ...
Quadratic trigonometric polynomial curves with a shape parameter

Quadratic trigonometric polynomial curves with a shape parameter are presented in this paper. Analogous to the quadratic B-spline curves, the trigonometric polynomial curves are constructed with three consecutive control points for each curve segment ...
Cubic trigonometric polynomial curves with a shape parameter

Cubic trigonometric polynomial curves with a shape parameter are presented in this paper. The trigonometric polynomial curves are C² continuous and G³ continuous with a non-uniform knot vector. With a uniform knot vector, the trigonometric polynomial ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
51
Total Downloads

Downloads (Last 12 months)51
Downloads (Last 6 weeks)4

Reflects downloads up to 28 Jul 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents