Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3503161.3548406acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

mmLayout: Multi-grained MultiModal Transformer for Document Understanding

Published: 10 October 2022 Publication History

Abstract

Recent efforts of multimodal Transformers have improved Visually Rich Document Understanding (VrDU) tasks via incorporating visual and textual information. However, existing approaches mainly focus on fine-grained elements such as words and document image patches, making it hard for them to learn from coarse-grained elements, including natural lexical units like phrases and salient visual regions like prominent image regions. In this paper, we attach more importance to coarse-grained elements containing high-density information and consistent semantics, which are valuable for document understanding. At first, a document graph is proposed to model complex relationships among multi-grained multimodal elements, in which salient visual regions are detected by a cluster-based method. Then, a multi-grained multimodal Transformer called mmLayout is proposed to incorporate coarse-grained information into existing pre-trained fine-grained multimodal Transformers based on the graph. In mmLayout, coarse-grained information is aggregated from fine-grained, and then, after further processing, is fused back into fine-grained for final prediction. Furthermore, common sense enhancement is introduced to exploit the semantic information of natural lexical units. Experimental results on four tasks, including information extraction and document question answering, show that our method can improve the performance of multimodal Transformers based on fine-grained elements and achieve better performance with fewer parameters. Qualitative analyses show that our method can capture consistent semantics in coarse-grained elements.

Supplementary Material

MP4 File (MM22-fp3074.mp4)
Presentation video of mmLayout: Multi-grained MultiModal Transformer for Document Understanding

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In CVPR 2018. IEEE, Salt Lake City, UT, 6077--6086. https://doi.org/10.1109/CVPR.2018.00636
[2]
Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, and R. Manmatha. 2021. DocFormer: End-to-End Transformer for Document Understanding. In ICCV 2021. 993--1003.
[3]
Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Jianfeng Gao, Songhao Piao, Ming Zhou, and Hsiao-Wuen Hon. 2020. UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training. In Proceedings of the 37th International Conference on Machine Learning. PMLR, 642--652.
[4]
Deng Cai and Wai Lam. 2019. Graph Transformer for Graph-to-Sequence Learning. In AAAI 2020.
[5]
Manuel Carbonell, Pau Riba, Mauricio Villegas, Alicia Fornes, and Josep Llados. 2021. Named Entity Recognition and Relation Extraction with Graph Neural Networks in Semi Structured Documents. In ICPR 2020. IEEE, Milan, Italy, 9622-- 9627. https://doi.org/10.1109/ICPR48806.2021.9412669
[6]
Lei Cui, Yiheng Xu, Tengchao Lv, and FuruWei. 2021. Document AI: Benchmarks, Models and Applications. arXiv:2111.08609 [cs] (Nov. 2021). arXiv:2111.08609 [cs]
[7]
Timo I. Denk and Christian Reisswig. 2019. BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding. In NeurIPS 2019 Workshop.
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186. https://doi.org/10.18653/v1/N19--1423
[9]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. Density based spatial clustering of applications with noise. In Int. Conf. Knowledge Discovery and Data Mining, Vol. 240. 6.
[10]
Lukasz Garncarek, Rafa Powalski, Tomasz Stanisawek, Bartosz Topolski, Piotr Halama, Micha Turski, and Filip Graliski. 2021. LAMBERT: Layout-Aware Language Modeling for Information Extraction. In ICDAR 2021 (Lecture Notes in Computer Science), Josep Lladós, Daniel Lopresti, and Seiichi Uchida (Eds.). Springer International Publishing, Cham, 532--547. https://doi.org/10.1007/978-3-030-86549-8_34
[11]
Zhangxuan Gu, Changhua Meng, Ke Wang, Jun Lan, Weiqiang Wang, Ming Gu, and Liqing Zhang. 2022. XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding. In CVPR 2022. 10.
[12]
Weidong Guo, Mingjun Zhao, Lusheng Zhang, Di Niu, Jinwen Luo, Zhenhua Liu, Zhenyang Li, and Jianbo Tang. 2021. LICHEE: Improving Language Model Pre-training with Multi-grained Tokenization. In ACL 2021 Findings. Association for Computational Linguistics, Online, 1383--1392. https://doi.org/10.18653/v1/2021.findings-acl.119
[13]
Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. 2021. Transformer in Transformer. In NeurIPS 2021.
[14]
Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, and Sungrae Park. 2022. BROS: A Pre-Trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents. In AAAI 2022.
[15]
Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and C. V. Jawahar. 2019. ICDAR 2019 Competition on Scanned Receipt OCR and Information Extraction. In ICDAR 2019. 1516--1520. https://doi.org/10.1109/ICDAR.2019.00244
[16]
Wonseok Hwang, Jinyeong Yim, Seunghyun Park, Sohee Yang, and Minjoon Seo. 2021. Spatial Dependency Parsing for Semi-Structured Document Information Extraction. In ACL 2021 Findings.
[17]
Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. 2019. FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents. In ICDAR 19 OST Workshop.
[18]
Anoop Raveendra Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Höhne, and Jean Baptiste Faddoul. 2018. Chargrid: Towards Understanding 2D Documents. In EMNLP 2018.
[19]
Kamran Khan, Saif Ur Rehman, Kamran Aziz, Simon Fong, and S. Sarasvady. 2014. DBSCAN: Past, Present and Future. In The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014). 232--238. https://doi.org/10.1109/ICADIWT.2014.6814687
[20]
Chen-Yu Lee, Chun-Liang Li, Timothy Dozat, Vincent Perot, Guolong Su, Nan Hua, Joshua Ainslie, Renshen Wang, Yasuhisa Fujii, and Tomas Pfister. 2022. FormNet: Structural Encoding beyond Sequential Modeling in Form Document Information Extraction. In ACL 2022. https://doi.org/10.48550/arXiv.2203.08411
[21]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In Proceedings of the European Conference on Computer Vision (ECCV). 201--216.
[22]
Chenliang Li, Bin Bi, Ming Yan, Wei Wang, Songfang Huang, Fei Huang, and Luo Si. 2021. StructuralLM: Structural Pre-training for Form Understanding. In ACL 2021.
[23]
Peizhao Li, Jiuxiang Gu, Jason Kuen, Vlad I. Morariu, Handong Zhao, Rajiv Jain, Varun Manjunatha, and Hongfu Liu. 2021. SelfDoc: Self-Supervised Document Representation Learning. In CVPR 2021. 5652--5660.
[24]
Yulin Li, Yuxi Qian, Yuchen Yu, Xiameng Qin, Chengquan Zhang, Yan Liu, Kun Yao, Junyu Han, Jingtuo Liu, and Errui Ding. 2021. StrucTexT: Structured Text Understanding with Multi-Modal Transformers. In ACM MM 2021.
[25]
Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2117--2125.
[26]
Weihong Lin, Qifang Gao, Lei Sun, Zhuoyao Zhong, Kai Hu, Qin Ren, and Qiang Huo. 2021. ViBERTgrid: A Jointly Trained Multi-modal 2D Document Representation for Key Information Extraction from Documents. In ICDAR 2021 (Lecture Notes in Computer Science), Josep Lladós, Daniel Lopresti, and Seiichi Uchida (Eds.). Springer International Publishing, Cham, 548--563. https://doi.org/10.1007/978-3-030-86549-8_35
[27]
Xiaojing Liu, Feiyu Gao, Qiong Zhang, and Huasha Zhao. 2019. Graph Convolution for Multimodal Information Extraction from Visually Rich Documents. In NAACL-HLT 2019. Association for Computational Linguistics, Minneapolis, Minnesota, 32--39. https://doi.org/10.18653/v1/N19--2005
[28]
Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James Bradley Wendt, Qi Zhao, and Marc Najork. 2020. Representation Learning for Information Extraction from Form-like Documents. In ACL 2020. Association for Computational Linguistics, Online, 6495--6504. https://doi.org/10.18653/v1/2020.aclmain.580
[29]
Minesh Mathew, Dimosthenis Karatzas, and C. V. Jawahar. 2021. DocVQA: A Dataset for VQA on Document Images. In WACV 2021. IEEE, Waikoloa, HI, USA, 2199--2208. https://doi.org/10.1109/WACV48630.2021.00225
[30]
Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee, Jaeheung Surh, Minjoon Seo, and Hwalsuk Lee. 2019. CORD: A Consolidated Receipt Dataset for Post-OCR Parsing. In NeurIPS 2019. 4.
[31]
Rafa Powalski, Lukasz Borchmann, Dawid Jurkiewicz, Tomasz Dwojak, Micha Pietruszka, and Gabriela Paka. 2021. Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer. In ICDAR 2021.
[32]
Yujie Qian, Enrico Santus, Zhijing Jin, Jiang Guo, and Regina Barzilay. 2019. GraphIE: A Graph-Based Framework for Information Extraction. In NAACLHLT 2019. Association for Computational Linguistics, Minneapolis, Minnesota, 751--761. https://doi.org/10.18653/v1/N19--1082
[33]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3982--3992. https://doi.org/10.18653/v1/D19-1410
[34]
Ritesh Sarkhel and Arnab Nandi. 2019. Deterministic Routing between Layout Abstractions for Multi-Scale Classification of Visually Rich Documents. In IJCAI 2019. International Joint Conferences on Artificial Intelligence Organization, Macao, China, 3360--3366. https://doi.org/10.24963/ijcai.2019/466
[35]
Jiapeng Wang, Chongyu Liu, Lianwen Jin, Guozhi Tang, Jiaxin Zhang, Shuaitao Zhang, Qianying Wang, Yaqiang Wu, and Mingxiang Cai. 2021. Towards Robust Visual Information Extraction in Real World: New Dataset and Novel Solution. In AAAI 2021, Vol. 35. 2738--2745.
[36]
Zilong Wang, Mingjie Zhan, Xuebo Liu, and Ding Liang. 2020. DocStruct: A Multimodal Method to Extract Hierarchy Structure in Document for General Form Understanding. In EMNLP 2020 Findings. Association for Computational Linguistics, Online, 898--908. https://doi.org/10.18653/v1/2020.findings-emnlp.80
[37]
Mengxi Wei, YIfan He, and Qiong Zhang. 2020. Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models. In ACM SIGIR 2020 (SIGIR '20). Association for Computing Machinery, New York, NY, USA, 2367--2376. https://doi.org/10.1145/3397271.3401442
[38]
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated Residual Transformations for Deep Neural Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5987--5995. https://doi.org/10.1109/CVPR.2017.634
[39]
Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, FuruWei, and Ming Zhou. 2020. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In KDD 2020. 1192--1200. https://doi.org/10.1145/3394486.3403172
[40]
Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, and Furu Wei. 2021. LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding. arXiv:2104.08836 [cs] (April 2021). arXiv:2104.08836 [cs]
[41]
Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou. 2021. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. In ACL 2021. Association for Computational Linguistics, Online, 2579--2591. https://doi.org/10.18653/v1/2021.acl-long.201
[42]
Xiao Yang, Ersin Yumer, Paul Asente, Mike Kraley, Daniel Kifer, and C. Lee Giles. 2017. Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Networks. In CVPR 2017. 4342--4351. https://doi.org/10.1109/CVPR.2017.462
[43]
Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. 2021. Do Transformers Really Perform Bad for Graph Representation. In NeurIPS 2021.
[44]
Wenwen Yu, Ning Lu, Xianbiao Qi, Ping Gong, and Rong Xiao. 2020. PICK: Processing Key Information Extraction from Documents Using Improved Graph Learning-Convolutional Networks. In ICPR 2020.
[45]
Peng Zhang, Yunlu Xu, Zhanzhan Cheng, Shiliang Pu, Jing Lu, Liang Qiao, Yi Niu, and FeiWu. 2020. TRIE: End-to-End Text Reading and Information Extraction for Document Understanding. In ACM MM 2020. ACM, Seattle WA USA, 1413--1422. https://doi.org/10.1145/3394171.3413900
[46]
Xinsong Zhang, Pengshuai Li, and Hang Li. 2021. AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization. In ACL 2021 Findings. Association for Computational Linguistics, Online, 421--435. https://doi.org/10.18653/v1/2021.findings-acl.37
[47]
Yue Zhang, Zhang Bo, Rui Wang, Junjie Cao, Chen Li, and Zuyi Bao. 2021. Entity Relation Extraction as Dependency Parsing in Visually Rich Documents. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2759--2768. https://doi.org/10.18653/v1/2021.emnlp-main. 218
[48]
Xiaohui Zhao, Endi Niu, Zhuo Wu, and Xiaoguang Wang. 2019. CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor. arXiv:1903.12363 [cs] (June 2019). arXiv:1903.12363 (cs).

Cited By

View all
  • (2024)GraphMLLM: A Graph-Based Multi-level Layout Language-Independent Model for Document UnderstandingDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70533-5_14(227-243)Online publication date: 8-Sep-2024
  • (2023)Enhancing Visually-Rich Document Understanding via Layout Structure ModelingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612327(4513-4523)Online publication date: 27-Oct-2023
  • (2023)Language Independent Neuro-Symbolic Semantic Parsing for Form UnderstandingDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41679-8_8(130-146)Online publication date: 21-Aug-2023

Index Terms

  1. mmLayout: Multi-grained MultiModal Transformer for Document Understanding

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        MM '22: Proceedings of the 30th ACM International Conference on Multimedia
        October 2022
        7537 pages
        ISBN:9781450392037
        DOI:10.1145/3503161
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 10 October 2022

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. document graph
        2. document understanding
        3. layout
        4. multimodal

        Qualifiers

        • Research-article

        Funding Sources

        • NSFC
        • Chinese Knowledge Center for Engineering Sciences and Technology
        • Fundamental Research Funds for the Central Universities
        • National Key R&D Program of China
        • MoE Engineering Research Center of Digital Library

        Conference

        MM '22
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 995 of 4,171 submissions, 24%

        Upcoming Conference

        MM '24
        The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)90
        • Downloads (Last 6 weeks)10
        Reflects downloads up to 14 Oct 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)GraphMLLM: A Graph-Based Multi-level Layout Language-Independent Model for Document UnderstandingDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70533-5_14(227-243)Online publication date: 8-Sep-2024
        • (2023)Enhancing Visually-Rich Document Understanding via Layout Structure ModelingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612327(4513-4523)Online publication date: 27-Oct-2023
        • (2023)Language Independent Neuro-Symbolic Semantic Parsing for Form UnderstandingDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41679-8_8(130-146)Online publication date: 21-Aug-2023

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media