Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3503161.3547977acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

From Token to Word: OCR Token Evolution via Contrastive Learning and Semantic Matching for Text-VQA

Published: 10 October 2022 Publication History

Abstract

Text-based Visual Question Answering (Text-VQA) is a question-answering task to understand scene text, where the text is usually recognized by Optical Character Recognition (OCR) systems. However, the text from OCR systems often includes spelling errors, such as "pepsi" being recognized as "peosi". These OCR errors are one of the major challenges for Text-VQA systems. To address this, we propose a novel Text-VQA method to alleviate OCR errors via OCR token evolution. First, we artificially create the misspelled OCR tokens in the training time, and make the system more robust to the OCR errors. To be specific, we propose an OCR Token-Word Contrastive (TWC) learning task, which pre-trains word representation by augmenting OCR tokens via the Levenshtein distance between the OCR tokens and words in a dictionary. Second, by assuming that the majority of characters in misspelled OCR tokens are still correct, a multimodal transformer is proposed and fine-tuned to predict the answer using character-based word embedding. Specifically, we introduce a vocabulary predictor with character-level semantic matching, which enables the model to recover the correct word from the vocabulary even with misspelled OCR tokens. A variety of experimental evaluations show that our method outperforms the state-of-the-art methods on both TextVQA and ST-VQA datasets. The code will be released at https://github.com/xiaojino/TWA.

Supplementary Material

MP4 File (MM22-fp0936.mp4)
Presentation video - From Token to Word: OCR Token Evolution via Contrastive Learning and Semantic Matching for Text-VQA

References

[1]
Jon Almazán, Albert Gordo, Alicia Fornés, and Ernest Valveny. 2014. Word Spotting and Recognition with Embedded Attributes. IEEE Trans. Pattern Anal. Mach. Intell. 36, 12 (2014), 2552--2566.
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077--6086.
[3]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425--2433.
[4]
Zongwen Bai, Ying Li, Marcin Wozniak, Meili Zhou, and Di Li. 2021. DecomVQANet: Decomposing visual question answering deep network via tensor decomposition and regression. Pattern Recognit. 110 (2021), 107538.
[5]
Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Minesh Mathew, CV Jawahar, Ernest Valveny, and Dimosthenis Karatzas. 2019. Icdar 2019 competition on scene text visual question answering. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1563--1570.
[6]
Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. 2019. Scene text visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 4291--4301.
[7]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomás Mikolov. 2017. Enriching Word Vectors with Subword Information. Trans. Assoc. Comput. Linguistics 5 (2017), 135--146.
[8]
Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. 2018. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 71--79.
[9]
Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. 2018. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 71--79.
[10]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. 4171--4186.
[11]
Chenyu Gao, Qi Zhu, Peng Wang, Hui Li, Yuliang Liu, Anton Van den Hengel, and QiWu. 2021. Structured multimodal attentions for textvqa. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
[12]
Difei Gao, Ke Li, Ruiping Wang, Shiguang Shan, and Xilin Chen. 2020. Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12746--12756.
[13]
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. 6325--6334.
[14]
Wei Han, Hantao Huang, and Tao Han. 2020. Finding the Evidence: Localizationaware Answer Prediction for Text Visual Question Answering. In COLING, Donia Scott, Núria Bel, and Chengqing Zong (Eds.). 3118--3131.
[15]
Jie-Bo Hou, Xiaobin Zhu, Chang Liu, Chun Yang, Long-HuangWu, HongfaWang, and Xu-Cheng Yin. 2020. Detecting text in scene and traffic guide panels with attention anchor mechanism. IEEE Transactions on Intelligent Transportation Systems 22, 11 (2020), 6890--6899.
[16]
Ronghang Hu, Amanpreet Singh, Trevor Darrell, and Marcus Rohrbach. 2020. Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9992--10002.
[17]
Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, and Devi Parikh. 2018. Pythia v0. 1: the winning entry to the vqa challenge 2018. arXiv preprint arXiv:1807.09956 (2018).
[18]
Zan-Xia Jin, Heran Wu, Chun Yang, Fang Zhou, Jingyan Qin, Lei Xiao, and Xu-Cheng Yin. 2021. Ruart: A novel text-centered solution for text-based visual question answering. IEEE Transactions on Multimedia (2021).
[19]
Zan-Xia Jin, Bo-Wen Zhang, Fan Fang, Le-Le Zhang, and Xu-Cheng Yin. 2019. Health assistant: answering your questions anytime from biomedical literature. Bioinformatics 35, 20 (2019), 4129--4139.
[20]
Zan-Xia Jin, Bo-Wen Zhang, Fang Zhou, Jingyan Qin, and Xu-Cheng Yin. 2020. Ranking via partial ordering for answer selection. Information Sciences 538 (2020), 358--371.
[21]
Yash Kant, Dhruv Batra, Peter Anderson, Alexander G. Schwing, Devi Parikh, Jiasen Lu, and Harsh Agrawal. 2020. Spatially Aware Multimodal Transformers for TextVQA. In ECCV (Lecture Notes in Computer Science, Vol. 12354). 715--732.
[22]
Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear attention networks. In Advances in Neural Information Processing Systems. 1564--1574.
[23]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, Yoshua Bengio and Yann LeCun (Eds.).
[24]
Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, Vol. 10. Soviet Union, 707--710.
[25]
Wei Li, Jianhui Sun, Ge Liu, Linglan Zhao, and Xiangzhong Fang. 2020. Visual question answering with attention transfer and a cross-modal gating mechanism. Pattern Recognit. Lett. 133 (2020), 334--340.
[26]
Fen Liu, Guanghui Xu, QiWu, Qing Du,Wei Jia, and Mingkui Tan. 2020. Cascade reasoning network for text-based visual question answering. In Proceedings of the 28th ACM International Conference on Multimedia. 4060--4069.
[27]
Xiaopeng Lu, Zhen Fan, Yansen Wang, Jean Oh, and Carolyn P Rosé. 2021. Localize, group, and select: Boosting text-vqa by scene text modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2631--2639.
[28]
Pengyuan Lyu, Minghui Liao, Cong Yao,WenhaoWu, and Xiang Bai. 2018. Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes. In Proceedings of the 15th European Conference on Computer Vision, ECCV 2018 (Lecture Notes in Computer Science, Vol. 11218). Springer, 71--88.
[29]
Wentao Ma, Yiming Cui, Chenglei Si, Ting Liu, Shijin Wang, and Guoping Hu. 2020. CharBERT: Character-aware Pre-trained Language Model. In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020. International Committee on Computational Linguistics, 39--50.
[30]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster RCNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett (Eds.). 91--99.
[31]
Baoguang Shi, Mingkun Yang, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2019. ASTER: An Attentional Scene Text Recognizer with Flexible Rectification. IEEE transactions on pattern analysis and machine intelligence 41, 9 (2019), 2035--2048.
[32]
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards vqa models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8317--8326.
[33]
Amanpreet Singh, Guan Pang, Mandy Toh, Jing Huang,Wojciech Galuba, and Tal Hassner. 2021. TextOCR: Towards large-scale end-to-end reasoning for arbitraryshaped scene text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8802--8812.
[34]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.
[35]
Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In Advances in neural information processing systems. 2692--2700.
[36]
Qingqing Wang, Liqiang Xiao, Yue Lu, Yaohui Jin, and Hao He. 2021. Towards reasoning ability in scene text visual question answering. In Proceedings of the 29th ACM International Conference on Multimedia. 2281--2289.
[37]
Zhaokai Wang, Renda Bao, Qi Wu, and Si Liu. 2021. Confidence-aware nonrepetitive multimodal transformers for textcaps. In AAAI.
[38]
Rui Yan, Jinhui Tang, Xiangbo Shu, Zechao Li, and Qi Tian. 2018. Participationcontributed temporal dynamic model for group activity recognition. In Proceedings of the 26th ACM international conference on Multimedia. 1292--1300.
[39]
Rui Yan, Lingxi Xie, Jinhui Tang, Xiangbo Shu, and Qi Tian. 2020. HiGCIN: Hierarchical graph-based cross inference network for group activity recognition. IEEE transactions on pattern analysis and machine intelligence (2020).
[40]
Rui Yan, Lingxi Xie, Jinhui Tang, Xiangbo Shu, and Qi Tian. 2020. Social adaptive module for weakly-supervised group activity recognition. In European Conference on Computer Vision. Springer, 208--224.
[41]
Zhengyuan Yang, Yijuan Lu, JianfengWang, Xi Yin, Dinei Florencio, LijuanWang, Cha Zhang, Lei Zhang, and Jiebo Luo. 2021. Tap: Text-aware pre-training for text-vqa and text-caption. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8751--8761.
[42]
Gangyan Zeng, Yuan Zhang, Yu Zhou, and Xiaomeng Yang. 2021. Beyond OCR VQA: Involving ocr into the flow for robust and accurate textvqa. In Proceedings of the 29th ACM International Conference on Multimedia. 376--385.
[43]
Shi-Xue Zhang, Xiaobin Zhu, Chun Yang, Hongfa Wang, and Xu-Cheng Yin. 2021. Adaptive boundary proposal network for arbitrary shape text detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1305--1314.

Cited By

View all
  • (2024)Unveiling Digital Secrets: An Image Text Vision App for Enhanced Digital Forensics Investigations2024 12th International Symposium on Digital Forensics and Security (ISDFS)10.1109/ISDFS60797.2024.10527293(1-6)Online publication date: 29-Apr-2024
  • (2024)Segment then Match: Find the Carrier before Reasoning in Scene-Text VQAICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10445873(8130-8134)Online publication date: 14-Apr-2024
  • (2023)Self-Supervised Implicit Glyph Attention for Text Recognition2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.01467(15285-15294)Online publication date: Jun-2023

Index Terms

  1. From Token to Word: OCR Token Evolution via Contrastive Learning and Semantic Matching for Text-VQA

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '22: Proceedings of the 30th ACM International Conference on Multimedia
    October 2022
    7537 pages
    ISBN:9781450392037
    DOI:10.1145/3503161
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 October 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. answer predictor
    2. ocr token representation
    3. text-based visual question answering
    4. vision-and-language pre-training model

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 995 of 4,171 submissions, 24%

    Upcoming Conference

    MM '24
    The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne , VIC , Australia

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)107
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 30 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Unveiling Digital Secrets: An Image Text Vision App for Enhanced Digital Forensics Investigations2024 12th International Symposium on Digital Forensics and Security (ISDFS)10.1109/ISDFS60797.2024.10527293(1-6)Online publication date: 29-Apr-2024
    • (2024)Segment then Match: Find the Carrier before Reasoning in Scene-Text VQAICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10445873(8130-8134)Online publication date: 14-Apr-2024
    • (2023)Self-Supervised Implicit Glyph Attention for Text Recognition2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.01467(15285-15294)Online publication date: Jun-2023

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media