Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3343031.3351093acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

You Only Recognize Once: Towards Fast Video Text Spotting

Published: 15 October 2019 Publication History

Abstract

Video text spotting is still an important research topic due to its various real-applications. Previous approaches usually fall into the four-staged pipeline: text detection in individual images, frame-wisely recognizing localized text regions, tracking text streams and generating final results with complicated post-processing skills, which might suffer from the huge computational cost as well as the interferences of low-quality text. In this paper, we propose a fast and robust video text spotting framework by only recognizing the localized text one-time instead of frame-wisely recognition. Specifically, we first obtain text regions in videos with a well-designed spatial-temporal detector. Then we concentrate on developing a novel text recommender for selecting the highest-quality text from text streams and only recognizing the selected ones. Here, the recommender assembles text tracking, quality scoring and recognition into an end-to-end trainable module, which not only avoids the interferences from low-quality text but also dramatically speeds up the video text spotting process. In addition, we collect a larger scale video text dataset (LSVTD) for promoting the video text spotting community, which contains 100 text videos from 22 different real-life scenarios. Extensive experiments on two public benchmarks show that our method greatly speeds up the recognition process averagely by 71 times compared with the frame-wise manner, and also achieves the remarkable state-of-the-art.

Supplementary Material

ZIP File (fp997aux.zip)
The supp.zip contains a demo video and a supplement document of the main paper.

References

[1]
Fan Bai, Zhanzhan Cheng, Yi Niu, Shiliang Pu, and Shuigeng Zhou. 2018. Edit Probability for Scene Text Recognition. In CVPR. 1508--1516.
[2]
Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. 2018. Rosetta: Large Scale System for Text Detection and Recognition in Images. In SIGKDD . 71--79.
[3]
Dapeng Chen, Hongsheng Li, Tong Xiao, Shuai Yi, and Xiaogang Wang. 2018. Video Person Re-Identification With Competitive Snippet-Similarity Aggregation and Co-Attentive Snippet Embedding. In CVPR. 1169--1178.
[4]
Zhanzhan Cheng, Fan Bai, Yunlu Xu, Gang Zheng, Shiliang Pu, and Shuigeng Zhou. 2017. Focusing attention: Towards accurate text recognition in natural images. In ICCV . 5086--5094.
[5]
Zhanzhan Cheng, Yangliu Xu, Fan Bai, Yi Niu, Shiliang Pu, and Shuigeng Zhou. 2018. AON: Towards Arbitrarily-Oriented Text Recognition. In CVPR . 5571--5579.
[6]
Victor Fragoso, Steffen Gauglitz, Shane Zamora, Jim Kleban, and Matthew Turk. 2011. TranslatAR: A mobile augmented reality translator. In WACV. 497--502.
[7]
Ll'ifs Gómez and Dimosthenis Karatzas. 2014. MSER-based real-time text detection and tracking. In ICPR. 3110--3115.
[8]
Hideaki Goto and Makoto Tanaka. 2009. Text-tracking wearable camera system for the blind. In ICDAR. 141--145.
[9]
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist Temporal Classification : Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In ICML. 369--376.
[10]
Jack Greenhalgh and Majid Mirmehdi. 2015. Recognizing Text-Based Traffic Signs . IEEE TITS, Vol. 16, 3 (2015), 1360--1369.
[11]
Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In CVPR. 1735--1742.
[12]
Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, and Changming Sun. 2018. An End-to-End TextSpotter With Explicit Alignment and Attention. In CVPR . 5020--5029.
[13]
Wenhao He, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu. 2017. Deep Direct Regression for Multi-Oriented Scene Text Detection. In ICCV . 745--753.
[14]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. (2015).

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '19: Proceedings of the 27th ACM International Conference on Multimedia
October 2019
2794 pages
ISBN:9781450368896
DOI:10.1145/3343031
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. detection
  2. quality scoring
  3. tracking
  4. video text spotting

Qualifiers

  • Research-article

Conference

MM '19
Sponsor:

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)24
  • Downloads (Last 6 weeks)2
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)DSText V2: A comprehensive video text spotting dataset for dense and small textPattern Recognition10.1016/j.patcog.2023.110177149(110177)Online publication date: May-2024
  • (2024)Video text tracking with transformer-based local searchNeurocomputing10.1016/j.neucom.2024.128420(128420)Online publication date: Aug-2024
  • (2024)End-to-End Video Text Spotting with TransformerInternational Journal of Computer Vision10.1007/s11263-024-02063-1Online publication date: 12-Jul-2024
  • (2023)Problems of Combining Multiple Text Recognition ResultsScientific and Technical Information Processing10.3103/S014768822305002750:5(368-375)Online publication date: 1-Dec-2023
  • (2023)Towards accurate video text spotting with text-wise semantic reasoningProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/206(1858-1866)Online publication date: 19-Aug-2023
  • (2023)VTLayout: A Multi-Modal Approach for Video Text LayoutProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611870(2775-2784)Online publication date: 26-Oct-2023
  • (2023)FlowText: Synthesizing Realistic Scene Text Video with Optical Flow Estimation2023 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME55011.2023.00262(1517-1522)Online publication date: Jul-2023
  • (2023)ICDAR 2023 Competition on Born Digital Video Text Question AnsweringDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41679-8_30(508-521)Online publication date: 19-Aug-2023
  • (2023)ICDAR 2023 Competition on Video Text Reading for Dense and Small TextDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41679-8_23(405-419)Online publication date: 19-Aug-2023
  • (2022)Document image analysis and recognition: a surveyComputer Optics10.18287/2412-6179-CO-102046:4(567-589)Online publication date: Aug-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media