research-article

From Token to Word: OCR Token Evolution via Contrastive Learning and Semantic Matching for Text-VQA

Authors:

Mike Zheng Shou,

Satoshi Tsutsui,

Xu-Cheng YinAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 4564 - 4572

https://doi.org/10.1145/3503161.3547977

Published: 10 October 2022 Publication History

Abstract

Text-based Visual Question Answering (Text-VQA) is a question-answering task to understand scene text, where the text is usually recognized by Optical Character Recognition (OCR) systems. However, the text from OCR systems often includes spelling errors, such as "pepsi" being recognized as "peosi". These OCR errors are one of the major challenges for Text-VQA systems. To address this, we propose a novel Text-VQA method to alleviate OCR errors via OCR token evolution. First, we artificially create the misspelled OCR tokens in the training time, and make the system more robust to the OCR errors. To be specific, we propose an OCR Token-Word Contrastive (TWC) learning task, which pre-trains word representation by augmenting OCR tokens via the Levenshtein distance between the OCR tokens and words in a dictionary. Second, by assuming that the majority of characters in misspelled OCR tokens are still correct, a multimodal transformer is proposed and fine-tuned to predict the answer using character-based word embedding. Specifically, we introduce a vocabulary predictor with character-level semantic matching, which enables the model to recover the correct word from the vocabulary even with misspelled OCR tokens. A variety of experimental evaluations show that our method outperforms the state-of-the-art methods on both TextVQA and ST-VQA datasets. The code will be released at https://github.com/xiaojino/TWA.

Supplementary Material

MP4 File (MM22-fp0936.mp4)

Presentation video - From Token to Word: OCR Token Evolution via Contrastive Learning and Semantic Matching for Text-VQA

Download
37.67 MB

References

[1]

Jon Almazán, Albert Gordo, Alicia Fornés, and Ernest Valveny. 2014. Word Spotting and Recognition with Embedded Attributes. IEEE Trans. Pattern Anal. Mach. Intell. 36, 12 (2014), 2552--2566.

[2]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077--6086.

[3]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425--2433.

Digital Library

[4]

Zongwen Bai, Ying Li, Marcin Wozniak, Meili Zhou, and Di Li. 2021. DecomVQANet: Decomposing visual question answering deep network via tensor decomposition and regression. Pattern Recognit. 110 (2021), 107538.

[5]

Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Minesh Mathew, CV Jawahar, Ernest Valveny, and Dimosthenis Karatzas. 2019. Icdar 2019 competition on scene text visual question answering. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1563--1570.

[6]

Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. 2019. Scene text visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 4291--4301.

[7]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomás Mikolov. 2017. Enriching Word Vectors with Subword Information. Trans. Assoc. Comput. Linguistics 5 (2017), 135--146.

[8]

Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. 2018. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 71--79.

Digital Library

[9]

Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. 2018. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 71--79.

Digital Library

[10]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. 4171--4186.

[11]

Chenyu Gao, Qi Zhu, Peng Wang, Hui Li, Yuliang Liu, Anton Van den Hengel, and QiWu. 2021. Structured multimodal attentions for textvqa. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).

[12]

Difei Gao, Ke Li, Ruiping Wang, Shiguang Shan, and Xilin Chen. 2020. Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12746--12756.

[13]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. 6325--6334.

[14]

Wei Han, Hantao Huang, and Tao Han. 2020. Finding the Evidence: Localizationaware Answer Prediction for Text Visual Question Answering. In COLING, Donia Scott, Núria Bel, and Chengqing Zong (Eds.). 3118--3131.

[15]

Jie-Bo Hou, Xiaobin Zhu, Chang Liu, Chun Yang, Long-HuangWu, HongfaWang, and Xu-Cheng Yin. 2020. Detecting text in scene and traffic guide panels with attention anchor mechanism. IEEE Transactions on Intelligent Transportation Systems 22, 11 (2020), 6890--6899.

Digital Library

[16]

Ronghang Hu, Amanpreet Singh, Trevor Darrell, and Marcus Rohrbach. 2020. Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9992--10002.

[17]

Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, and Devi Parikh. 2018. Pythia v0. 1: the winning entry to the vqa challenge 2018. arXiv preprint arXiv:1807.09956 (2018).

[18]

Zan-Xia Jin, Heran Wu, Chun Yang, Fang Zhou, Jingyan Qin, Lei Xiao, and Xu-Cheng Yin. 2021. Ruart: A novel text-centered solution for text-based visual question answering. IEEE Transactions on Multimedia (2021).

[19]

Zan-Xia Jin, Bo-Wen Zhang, Fan Fang, Le-Le Zhang, and Xu-Cheng Yin. 2019. Health assistant: answering your questions anytime from biomedical literature. Bioinformatics 35, 20 (2019), 4129--4139.

[20]

Zan-Xia Jin, Bo-Wen Zhang, Fang Zhou, Jingyan Qin, and Xu-Cheng Yin. 2020. Ranking via partial ordering for answer selection. Information Sciences 538 (2020), 358--371.

[21]

Yash Kant, Dhruv Batra, Peter Anderson, Alexander G. Schwing, Devi Parikh, Jiasen Lu, and Harsh Agrawal. 2020. Spatially Aware Multimodal Transformers for TextVQA. In ECCV (Lecture Notes in Computer Science, Vol. 12354). 715--732.

Digital Library

[22]

Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear attention networks. In Advances in Neural Information Processing Systems. 1564--1574.

[23]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, Yoshua Bengio and Yann LeCun (Eds.).

[24]

Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, Vol. 10. Soviet Union, 707--710.

[25]

Wei Li, Jianhui Sun, Ge Liu, Linglan Zhao, and Xiangzhong Fang. 2020. Visual question answering with attention transfer and a cross-modal gating mechanism. Pattern Recognit. Lett. 133 (2020), 334--340.

[26]

Fen Liu, Guanghui Xu, QiWu, Qing Du,Wei Jia, and Mingkui Tan. 2020. Cascade reasoning network for text-based visual question answering. In Proceedings of the 28th ACM International Conference on Multimedia. 4060--4069.

Digital Library

[27]

Xiaopeng Lu, Zhen Fan, Yansen Wang, Jean Oh, and Carolyn P Rosé. 2021. Localize, group, and select: Boosting text-vqa by scene text modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2631--2639.

[28]

Pengyuan Lyu, Minghui Liao, Cong Yao,WenhaoWu, and Xiang Bai. 2018. Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes. In Proceedings of the 15th European Conference on Computer Vision, ECCV 2018 (Lecture Notes in Computer Science, Vol. 11218). Springer, 71--88.

[29]

Wentao Ma, Yiming Cui, Chenglei Si, Ting Liu, Shijin Wang, and Guoping Hu. 2020. CharBERT: Character-aware Pre-trained Language Model. In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020. International Committee on Computational Linguistics, 39--50.

[30]

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster RCNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett (Eds.). 91--99.

[31]

Baoguang Shi, Mingkun Yang, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2019. ASTER: An Attentional Scene Text Recognizer with Flexible Rectification. IEEE transactions on pattern analysis and machine intelligence 41, 9 (2019), 2035--2048.

[32]

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards vqa models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8317--8326.

[33]

Amanpreet Singh, Guan Pang, Mandy Toh, Jing Huang,Wojciech Galuba, and Tal Hassner. 2021. TextOCR: Towards large-scale end-to-end reasoning for arbitraryshaped scene text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8802--8812.

[34]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

[35]

Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In Advances in neural information processing systems. 2692--2700.

[36]

Qingqing Wang, Liqiang Xiao, Yue Lu, Yaohui Jin, and Hao He. 2021. Towards reasoning ability in scene text visual question answering. In Proceedings of the 29th ACM International Conference on Multimedia. 2281--2289.

Digital Library

[37]

Zhaokai Wang, Renda Bao, Qi Wu, and Si Liu. 2021. Confidence-aware nonrepetitive multimodal transformers for textcaps. In AAAI.

[38]

Rui Yan, Jinhui Tang, Xiangbo Shu, Zechao Li, and Qi Tian. 2018. Participationcontributed temporal dynamic model for group activity recognition. In Proceedings of the 26th ACM international conference on Multimedia. 1292--1300.

Digital Library

[39]

Rui Yan, Lingxi Xie, Jinhui Tang, Xiangbo Shu, and Qi Tian. 2020. HiGCIN: Hierarchical graph-based cross inference network for group activity recognition. IEEE transactions on pattern analysis and machine intelligence (2020).

[40]

Rui Yan, Lingxi Xie, Jinhui Tang, Xiangbo Shu, and Qi Tian. 2020. Social adaptive module for weakly-supervised group activity recognition. In European Conference on Computer Vision. Springer, 208--224.

Digital Library

[41]

Zhengyuan Yang, Yijuan Lu, JianfengWang, Xi Yin, Dinei Florencio, LijuanWang, Cha Zhang, Lei Zhang, and Jiebo Luo. 2021. Tap: Text-aware pre-training for text-vqa and text-caption. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8751--8761.

[42]

Gangyan Zeng, Yuan Zhang, Yu Zhou, and Xiaomeng Yang. 2021. Beyond OCR VQA: Involving ocr into the flow for robust and accurate textvqa. In Proceedings of the 29th ACM International Conference on Multimedia. 376--385.

Digital Library

[43]

Shi-Xue Zhang, Xiaobin Zhu, Chun Yang, Hongfa Wang, and Xu-Cheng Yin. 2021. Adaptive boundary proposal network for arbitrary shape text detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1305--1314.

Cited By

Bandal SRath S(2024)Unveiling Digital Secrets: An Image Text Vision App for Enhanced Digital Forensics Investigations2024 12th International Symposium on Digital Forensics and Security (ISDFS)10.1109/ISDFS60797.2024.10527293(1-6)Online publication date: 29-Apr-2024
https://doi.org/10.1109/ISDFS60797.2024.10527293
Fang CLi LLiu JLi BHu DMa C(2024)Segment then Match: Find the Carrier before Reasoning in Scene-Text VQAICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10445873(8130-8134)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10445873
Guan TGu CTu JYang XFeng QZhao YShen W(2023)Self-Supervised Implicit Glyph Attention for Text Recognition2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.01467(15285-15294)Online publication date: Jun-2023
https://doi.org/10.1109/CVPR52729.2023.01467

Index Terms

From Token to Word: OCR Token Evolution via Contrastive Learning and Semantic Matching for Text-VQA
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding

Recommendations

OCR of printed telugu text with high recognition accuracies
ICVGIP'06: Proceedings of the 5th Indian conference on Computer Vision, Graphics and Image Processing

Telugu is one of the oldest and popular languages of India spoken by more than 66 million people especially in South India. Development of Optical Character Recognition systems for Telugu text is an area of current research.

OCR of Indian scripts is ...
Development of OCR Techniques for Handwritten Bangla Text: OCR Techniques for Bangla Text
Typefaces and Ligatures in Printed Arabic Text: A Deep Learning-Based OCR Perspective
Document Analysis and Recognition – ICDAR 2023 Workshops
Abstract
Arabic script is complex, with multiple shapes for the same characters in different positions. Another challenge of the script, in the context of recognition, is ligatures. A combination of a specific two or more character sequence takes a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

October 2022

7537 pages

ISBN:9781450392037

DOI:10.1145/3503161

General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

MM '22

Sponsor:

SIGMM

MM '22: The 30th ACM International Conference on Multimedia

October 10 - 14, 2022

Lisboa, Portugal

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
315
Total Downloads

Downloads (Last 12 months)107
Downloads (Last 6 weeks)6

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bandal SRath S(2024)Unveiling Digital Secrets: An Image Text Vision App for Enhanced Digital Forensics Investigations2024 12th International Symposium on Digital Forensics and Security (ISDFS)10.1109/ISDFS60797.2024.10527293(1-6)Online publication date: 29-Apr-2024
https://doi.org/10.1109/ISDFS60797.2024.10527293
Fang CLi LLiu JLi BHu DMa C(2024)Segment then Match: Find the Carrier before Reasoning in Scene-Text VQAICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10445873(8130-8134)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10445873
Guan TGu CTu JYang XFeng QZhao YShen W(2023)Self-Supervised Implicit Glyph Attention for Text Recognition2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.01467(15285-15294)Online publication date: Jun-2023
https://doi.org/10.1109/CVPR52729.2023.01467

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents