short-paper

MMOCR: A Comprehensive Toolbox for Text Detection, Recognition and Understanding

Authors:

Zhanghui Kuang,

Wayne Zhang, and

Dahua LinAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

Pages 3791 - 3794

https://doi.org/10.1145/3474085.3478328

Published: 17 October 2021 Publication History

Abstract

We present MMOCR---an open-source toolbox which provides a comprehensive pipeline for text detection and recognition, as well as their downstream tasks such as named entity recognition and key information extraction. MMOCR implements 14 state-of-the-art algorithms, which is significantly more than all the existing open-source OCR projects we are aware of to date. To facilitate future research and industrial applications of text recognition-related problems, we also provide a large number of trained models and detailed benchmarks to give insights into the performance of text detection, recognition and understanding. MMOCR is publicly released at https://github.com/open-mmlab/mmocr.

References

[1]

Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee. 2019. Character region awareness for text detection. In CVPR. 9365--9374.

[2]

Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. 2019. Rosetta: Large scale system for text detection and recognition in images. ACM SIGKDD (2019), 71--79.

Digital Library

[3]

Jason P.C. Chiu and Eric Nichols. 2016. Named Entity Recognition with Bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics, Vol. 4 (2016), 357--370.

[4]

Jiaqi Duan, Youjiang Xu, Zhanghui Kuang, Xiaoyu Yue, Hongbin Sun, Yue Guan, and Wayne Zhang. 2019. Geometry normalization networks for accurate scene text detection. In ICCV. 9136--9145.

[5]

Anoop Raveendra Katti Faddoul, Christian Reisswig Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Hö hne, and Jean Baptiste. 2018. Chargrid: Towards Understanding 2D Documents. In EMNLP. 4459--4469.

[6]

Ross Girshick. 2015. Fast R-CNN. In ICCV. 1440--1448.

Digital Library

[7]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In ICCV. 2961--2969.

[8]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.

[9]

Yuanduo Hong, Huihui Pan, Weichao Sun, and Yisong Jia. 2021. Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes. CoRR, Vol. abs/2101.06085 (2021).

[10]

Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016).

[11]

Hui Li, Peng Wang, Chunhua Shen, and Guyu Zhang. 2019. Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition. AAAI (2019), 8610--8617.

[12]

Minghui Liao, Zhaoyi Wan, Cong Yao, Kai Chen, and Xiang Bai. 2020. Real-Time Scene Text Detection with Differentiable Binarization. In AAAI. 11474--11481.

[13]

Minghui Liao, Jian Zhang, Zhaoyi Wan, Fengming Xie, Jiajun Liang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2019. Scene text recognition from two-dimensional perspective. AAAI (2019), 8714--8721.

[14]

Jingchao Liu, Xuebo Liu, Jie Sheng, Ding Liang, Xin Li, and Qingjie Liu. 2019. Pyramid Mask Text Detector. CoRR (2019).

[15]

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng Yang Fu, and Alexander C. Berg. 2016a. SSD: single shot multibox detector. In ECCV. 21--37.

[16]

Wei Liu, Chaofeng Chen, Kwan-Yee K Wong, Zhizhong Su, and Junyu Han. 2016b. STAR-Net: A SpaTial Attention Residue Network for Scene Text Recognition. In BMVC.

[17]

Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In CVPR. 3431--3440.

[18]

Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, and Cong Yao. 2018. TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes. In ECCV. 19--35.

[19]

Rasmus Berg Palm, Ole Winther, and Florian Laws. 2017. CloudScan - A configuration-free invoice analysis system using recurrent neural networks. In ICDAR. 406--413.

[20]

Joseph Redmon and Ali Farhadi. 2017. YOLO9000: Better, Faster, Stronger. In CVPR. 6517--6525.

[21]

Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement. CoRR, Vol. abs/1804.02767 (2018).

[22]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS. 91--99.

Digital Library

[23]

Fenfen Sheng, Zhineng Chen, and Bo Xu. 2019. NRTR: A no-recurrence sequence-to-sequence model for scene text recognition. In ICDAR. 781--786.

[24]

Baoguang Shi, Xiang Bai, and Cong Yao. 2016a. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition. PAMI, Vol. 39, 11 (2016), 2298--2304.

Digital Library

[25]

Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2016b. Robust Scene Text Recognition with Automatic Rectification. In CVPR. 4168--4176.

[26]

Baoguang Shi, Mingkun Yang, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2018. ASTER : An Attentional Scene Text Recognizer with Flexible Rectification. PAMI, Vol. 41, 9 (2018), 2035--2048.

[27]

Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR.

[28]

Ray Smith. 2007. An Overview of the Tesseract OCR Engine. In ICDAR. 629--633.

Digital Library

[29]

Hongbin Sun, Zhanghui Kuang, Xiaoyu Yue, Chenhao Lin, and Wayne Zhang. 2021. Spatial Dual-Modality Graph Reasoning for Key Information Extraction. arXiv preprint (2021).

[30]

C Szegedy, W Liu, Y Jia, and P Sermanet. 2015. Going deeper with convolutions. In CVPR. 1--9.

[31]

Pengfei Wang, Chengquan Zhang, Fei Qi, Zuming Huang, Mengyi En, Junyu Han, Jingtuo Liu, Errui Ding, and Guangming Shi. 2019 c. A Single-Shot Arbitrarily-Shaped Text Detector based on Context Attended Multi-Task Learning. In ACM MM. 1277--1285.

Digital Library

[32]

Pengfei Wang, Chengquan Zhang, Fei Qi, Shanshan Liu, Xiaoqiang Zhang, Pengyuan Lyu, Junyu Han, Jingtuo Liu, Errui Ding, and Guangming Shi. 2021. PGNet: Real-time Arbitrarily-Shaped Text Spotting with Point Gathering Network. In AAAI. 2782--2790.

[33]

Wenhai Wang, Enze Xie, Xiang Li, Wenbo Hou, Tong Lu, Gang Yu, and Shuai Shao. 2019 a. Shape robust text detection with progressive scale expansion network. In CVPR. 9336--9345.

[34]

Wenhai Wang, Enze Xie, Xiaoge Song, Yuhang Zang, Wenjia Wang, Tong Lu, Gang Yu, and Chunhua Shen. 2019 b. Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network. In ICCV. 8439--8448.

[35]

Zecheng Xie, Yaoxiong Huang, Yuanzhi Zhu, Lianwen Jin, Yuliang Liu, and Lele Xie. 2019. Aggregation cross-entropy for sequence recognition. In CVPR. 6538--6547.

[36]

Liang Xu, Yu Tong, Qianqian Dong, Yixuan Liao, Cong Yu, Yin Tian, Weitang Liu, Lu Li, Caiquan Liu, and Xuanwei Zhang. 2020. CLUENER2020: Fine-grained Named Entity Recognition Dataset and Benchmark for Chinese. arXiv preprint (2020).

[37]

Mingkun Yang, Yushuo Guan, Minghui Liao, Xin He, Kaigui Bian, Song Bai, Cong Yao, and Xiang Bai. 2019. Symmetry-constrained rectification network for scene text recognition. In ICCV. 9146--9155.

[38]

Deli Yu, Xuan Li, Chengquan Zhang, Tao Liu, Junyu Han, Jingtuo Liu, and Errui Ding. 2020 a. Towards Accurate Scene Text Recognition With Semantic Reasoning Networks. In CVPR. 12110--12119.

[39]

Wenwen Yu, Ning Lu, Xianbiao Qi, Ping Gong, and Rong Xiao. 2020 b. PICK: Processing key information extraction from documents using improved graph learning-convolutional networks. In ICPR. 4363--4370.

[40]

Xiaoyu Yue, Zhanghui Kuang, Chenhao Lin, Hongbin Sun, and Wayne Zhang. 2020. RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition. In ECCV. 135--151.

[41]

Xiaoyu Yue, Zhanghui Kuang, and Wayne Zhang. 2021. SegOCR: Simple Baseline. In Unpublished Manuscript.

[42]

Xiaoyu Yue, Zhanghui Kuang, Zhaoyang Zhang, Zhenfang Chen, Pan He, Yu Qiao, and Wei Zhang. 2018. Boosting up Scene Text Detectors with Guided CNN. In BMVC.

[43]

Shi-Xue Zhang, Xiaobin Zhu, Jie-Bo Hou, Chang Liu, Chun Yang, Hongfa Wang, and Xu-Cheng Yin. 2020. Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection. In CVPR. 9696--9705.

[44]

Suncong Zheng, Feng Wang, Hongyun Bao, Yuexing Hao, Peng Zhou, and Bo Xu. 2017. Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme. In ACL. 1227--1236.

[45]

Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. 2017. EAST: An Efficient and Accurate Scene Text Detector. CVPR (2017), 2642--2651.

[46]

Yiqin Zhu, Jianyong Chen, Lingyu Liang, Zhuanghui Kuang, Lianwen Jin, and Wayne Zhang. 2021. Fourier Contour Embedding for Arbitrary-Shaped Text Detection. In CVPR.

Cited By

Tang QLee YJung H(2024)The Industrial Application of Artificial Intelligence-Based Optical Character Recognition in Modern Manufacturing InnovationsSustainability10.3390/su1605216116:5(2161)Online publication date: 5-Mar-2024
https://doi.org/10.3390/su16052161
Cai YZhou FYin R(2024)Exploring Style-Robust Scene Text Detection via Style-Aware LearningElectronics10.3390/electronics1302024313:2(243)Online publication date: 5-Jan-2024
https://doi.org/10.3390/electronics13020243
Wang JYang LWang JYang HBai LWang PLi XLuo HXu H(2024)LCSTR: Scene Text Recognition with Large Convolutional KernelsInternational Journal of Pattern Recognition and Artificial Intelligence10.1142/S021800142353004X38:01Online publication date: 9-Feb-2024
https://doi.org/10.1142/S021800142353004X
Show More Cited By

Index Terms

MMOCR: A Comprehensive Toolbox for Text Detection, Recognition and Understanding
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object recognition

Recommendations

End-to-End Text Recognition Using Local Ternary Patterns, MSER and Deep Convolutional Nets
SBES '13: Proceedings of the 2013 27th Brazilian Symposium on Software Engineering

Text recognition in natural scene images is an application for several computer vision applications like licence plate recognition, automated translation of street signs, help for visually impaired people or image retrieval. In this work an end-to-end ...
Read More
Distantly Supervised Semantic Text Detection and Recognition for Broadcast Sports Videos Understanding
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Comprehensive understanding of key players and actions in multiplayer sports broadcast videos is a challenging problem. Unlike in news or finance videos, sports videos have limited text. While both action recognition for multiplayer sports and detection ...
Read More
Two-stage approach to named entity recognition using Wikipedia and DBpedia
IMCOM '17: Proceedings of the 11th International Conference on Ubiquitous Information Management and Communication

In natural language understanding, extraction of named entity (NE) mentions in given text and classification of the mentions into pre-defined NE types are important processes. Most NE recognition (NER) relies on resources such as a training corpus or NE ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

the Shanghai Committee of Science and Technology

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
273
Total Downloads

Downloads (Last 12 months)65
Downloads (Last 6 weeks)5

Other Metrics

View Author Metrics

Citations

Cited By

Tang QLee YJung H(2024)The Industrial Application of Artificial Intelligence-Based Optical Character Recognition in Modern Manufacturing InnovationsSustainability10.3390/su1605216116:5(2161)Online publication date: 5-Mar-2024
https://doi.org/10.3390/su16052161
Cai YZhou FYin R(2024)Exploring Style-Robust Scene Text Detection via Style-Aware LearningElectronics10.3390/electronics1302024313:2(243)Online publication date: 5-Jan-2024
https://doi.org/10.3390/electronics13020243
Wang JYang LWang JYang HBai LWang PLi XLuo HXu H(2024)LCSTR: Scene Text Recognition with Large Convolutional KernelsInternational Journal of Pattern Recognition and Artificial Intelligence10.1142/S021800142353004X38:01Online publication date: 9-Feb-2024
https://doi.org/10.1142/S021800142353004X
Low LMohd Salleh FLaw YZakaria N(2024)Detecting and recognizing seven segment digits using a deep learning approachITM Web of Conferences10.1051/itmconf/2024630100763(01007)Online publication date: 13-Feb-2024
https://doi.org/10.1051/itmconf/20246301007
Nourali KDolkhani E(2024)Scene text visual question answering by using YOLO and STNInternational Journal of Speech Technology10.1007/s10772-023-10081-627:1(69-76)Online publication date: 3-Jan-2024
https://dl.acm.org/doi/10.1007/s10772-023-10081-6
Sil PChaudhuri PRaman B(2024)Can AI Assistance Aid in the Grading of Handwritten Answer Sheets?Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky10.1007/978-3-031-64312-5_35(291-298)Online publication date: 2-Jul-2024
https://doi.org/10.1007/978-3-031-64312-5_35
Villena Toro JWiberg ATarkian M(2023)Optical character recognition on engineering drawings to achieve automation in production quality controlFrontiers in Manufacturing Technology10.3389/fmtec.2023.11541323Online publication date: 20-Mar-2023
https://doi.org/10.3389/fmtec.2023.1154132
Liang MZhu XZhou HQin JYin X(2023)HFENet: Hybrid Feature Enhancement Network for Detecting Texts in Scenes and Traffic PanelsIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2023.330568624:12(14200-14212)Online publication date: 25-Aug-2023
https://dl.acm.org/doi/10.1109/TITS.2023.3305686
Shu RZhao CFeng SZhu LMiao D(2023)Text-Enhanced Scene Image Super-Resolution via Stroke Mask and Orthogonal AttentionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.326713333:11(6317-6330)Online publication date: 14-Apr-2023
https://dl.acm.org/doi/10.1109/TCSVT.2023.3267133
Zhang WZhang CNing ZWang GBai YJiang ZZhang D(2023)M2SH: A Hybrid Approach to Table Structure Recognition using Two-Stage Multi-Modality Feature Fusion2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC53992.2023.10394093(791-798)Online publication date: 1-Oct-2023
https://doi.org/10.1109/SMC53992.2023.10394093
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents