research-article

mmLayout: Multi-grained MultiModal Transformer for Document Understanding

Authors:

Zhengjie Huang,

Qianglong Chen,

Yin ZhangAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 4877 - 4886

https://doi.org/10.1145/3503161.3548406

Published: 10 October 2022 Publication History

Abstract

Recent efforts of multimodal Transformers have improved Visually Rich Document Understanding (VrDU) tasks via incorporating visual and textual information. However, existing approaches mainly focus on fine-grained elements such as words and document image patches, making it hard for them to learn from coarse-grained elements, including natural lexical units like phrases and salient visual regions like prominent image regions. In this paper, we attach more importance to coarse-grained elements containing high-density information and consistent semantics, which are valuable for document understanding. At first, a document graph is proposed to model complex relationships among multi-grained multimodal elements, in which salient visual regions are detected by a cluster-based method. Then, a multi-grained multimodal Transformer called mmLayout is proposed to incorporate coarse-grained information into existing pre-trained fine-grained multimodal Transformers based on the graph. In mmLayout, coarse-grained information is aggregated from fine-grained, and then, after further processing, is fused back into fine-grained for final prediction. Furthermore, common sense enhancement is introduced to exploit the semantic information of natural lexical units. Experimental results on four tasks, including information extraction and document question answering, show that our method can improve the performance of multimodal Transformers based on fine-grained elements and achieve better performance with fewer parameters. Qualitative analyses show that our method can capture consistent semantics in coarse-grained elements.

Supplementary Material

MP4 File (MM22-fp3074.mp4)

Presentation video of mmLayout: Multi-grained MultiModal Transformer for Document Understanding

Download
487.21 MB

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In CVPR 2018. IEEE, Salt Lake City, UT, 6077--6086. https://doi.org/10.1109/CVPR.2018.00636

[2]

Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, and R. Manmatha. 2021. DocFormer: End-to-End Transformer for Document Understanding. In ICCV 2021. 993--1003.

[3]

Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Jianfeng Gao, Songhao Piao, Ming Zhou, and Hsiao-Wuen Hon. 2020. UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training. In Proceedings of the 37th International Conference on Machine Learning. PMLR, 642--652.

[4]

Deng Cai and Wai Lam. 2019. Graph Transformer for Graph-to-Sequence Learning. In AAAI 2020.

[5]

Manuel Carbonell, Pau Riba, Mauricio Villegas, Alicia Fornes, and Josep Llados. 2021. Named Entity Recognition and Relation Extraction with Graph Neural Networks in Semi Structured Documents. In ICPR 2020. IEEE, Milan, Italy, 9622-- 9627. https://doi.org/10.1109/ICPR48806.2021.9412669

[6]

Lei Cui, Yiheng Xu, Tengchao Lv, and FuruWei. 2021. Document AI: Benchmarks, Models and Applications. arXiv:2111.08609 [cs] (Nov. 2021). arXiv:2111.08609 [cs]

[7]

Timo I. Denk and Christian Reisswig. 2019. BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding. In NeurIPS 2019 Workshop.

[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186. https://doi.org/10.18653/v1/N19--1423

[9]

Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. Density based spatial clustering of applications with noise. In Int. Conf. Knowledge Discovery and Data Mining, Vol. 240. 6.

[10]

Lukasz Garncarek, Rafa Powalski, Tomasz Stanisawek, Bartosz Topolski, Piotr Halama, Micha Turski, and Filip Graliski. 2021. LAMBERT: Layout-Aware Language Modeling for Information Extraction. In ICDAR 2021 (Lecture Notes in Computer Science), Josep Lladós, Daniel Lopresti, and Seiichi Uchida (Eds.). Springer International Publishing, Cham, 532--547. https://doi.org/10.1007/978-3-030-86549-8_34

[11]

Zhangxuan Gu, Changhua Meng, Ke Wang, Jun Lan, Weiqiang Wang, Ming Gu, and Liqing Zhang. 2022. XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding. In CVPR 2022. 10.

[12]

Weidong Guo, Mingjun Zhao, Lusheng Zhang, Di Niu, Jinwen Luo, Zhenhua Liu, Zhenyang Li, and Jianbo Tang. 2021. LICHEE: Improving Language Model Pre-training with Multi-grained Tokenization. In ACL 2021 Findings. Association for Computational Linguistics, Online, 1383--1392. https://doi.org/10.18653/v1/2021.findings-acl.119

[13]

Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. 2021. Transformer in Transformer. In NeurIPS 2021.

[14]

Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, and Sungrae Park. 2022. BROS: A Pre-Trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents. In AAAI 2022.

[15]

Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and C. V. Jawahar. 2019. ICDAR 2019 Competition on Scanned Receipt OCR and Information Extraction. In ICDAR 2019. 1516--1520. https://doi.org/10.1109/ICDAR.2019.00244

[16]

Wonseok Hwang, Jinyeong Yim, Seunghyun Park, Sohee Yang, and Minjoon Seo. 2021. Spatial Dependency Parsing for Semi-Structured Document Information Extraction. In ACL 2021 Findings.

[17]

Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. 2019. FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents. In ICDAR 19 OST Workshop.

[18]

Anoop Raveendra Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Höhne, and Jean Baptiste Faddoul. 2018. Chargrid: Towards Understanding 2D Documents. In EMNLP 2018.

[19]

Kamran Khan, Saif Ur Rehman, Kamran Aziz, Simon Fong, and S. Sarasvady. 2014. DBSCAN: Past, Present and Future. In The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014). 232--238. https://doi.org/10.1109/ICADIWT.2014.6814687

[20]

Chen-Yu Lee, Chun-Liang Li, Timothy Dozat, Vincent Perot, Guolong Su, Nan Hua, Joshua Ainslie, Renshen Wang, Yasuhisa Fujii, and Tomas Pfister. 2022. FormNet: Structural Encoding beyond Sequential Modeling in Form Document Information Extraction. In ACL 2022. https://doi.org/10.48550/arXiv.2203.08411

[21]

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In Proceedings of the European Conference on Computer Vision (ECCV). 201--216.

Digital Library

[22]

Chenliang Li, Bin Bi, Ming Yan, Wei Wang, Songfang Huang, Fei Huang, and Luo Si. 2021. StructuralLM: Structural Pre-training for Form Understanding. In ACL 2021.

[23]

Peizhao Li, Jiuxiang Gu, Jason Kuen, Vlad I. Morariu, Handong Zhao, Rajiv Jain, Varun Manjunatha, and Hongfu Liu. 2021. SelfDoc: Self-Supervised Document Representation Learning. In CVPR 2021. 5652--5660.

[24]

Yulin Li, Yuxi Qian, Yuchen Yu, Xiameng Qin, Chengquan Zhang, Yan Liu, Kun Yao, Junyu Han, Jingtuo Liu, and Errui Ding. 2021. StrucTexT: Structured Text Understanding with Multi-Modal Transformers. In ACM MM 2021.

[25]

Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2117--2125.

[26]

Weihong Lin, Qifang Gao, Lei Sun, Zhuoyao Zhong, Kai Hu, Qin Ren, and Qiang Huo. 2021. ViBERTgrid: A Jointly Trained Multi-modal 2D Document Representation for Key Information Extraction from Documents. In ICDAR 2021 (Lecture Notes in Computer Science), Josep Lladós, Daniel Lopresti, and Seiichi Uchida (Eds.). Springer International Publishing, Cham, 548--563. https://doi.org/10.1007/978-3-030-86549-8_35

Digital Library

[27]

Xiaojing Liu, Feiyu Gao, Qiong Zhang, and Huasha Zhao. 2019. Graph Convolution for Multimodal Information Extraction from Visually Rich Documents. In NAACL-HLT 2019. Association for Computational Linguistics, Minneapolis, Minnesota, 32--39. https://doi.org/10.18653/v1/N19--2005

[28]

Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James Bradley Wendt, Qi Zhao, and Marc Najork. 2020. Representation Learning for Information Extraction from Form-like Documents. In ACL 2020. Association for Computational Linguistics, Online, 6495--6504. https://doi.org/10.18653/v1/2020.aclmain.580

[29]

Minesh Mathew, Dimosthenis Karatzas, and C. V. Jawahar. 2021. DocVQA: A Dataset for VQA on Document Images. In WACV 2021. IEEE, Waikoloa, HI, USA, 2199--2208. https://doi.org/10.1109/WACV48630.2021.00225

[30]

Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee, Jaeheung Surh, Minjoon Seo, and Hwalsuk Lee. 2019. CORD: A Consolidated Receipt Dataset for Post-OCR Parsing. In NeurIPS 2019. 4.

[31]

Rafa Powalski, Lukasz Borchmann, Dawid Jurkiewicz, Tomasz Dwojak, Micha Pietruszka, and Gabriela Paka. 2021. Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer. In ICDAR 2021.

[32]

Yujie Qian, Enrico Santus, Zhijing Jin, Jiang Guo, and Regina Barzilay. 2019. GraphIE: A Graph-Based Framework for Information Extraction. In NAACLHLT 2019. Association for Computational Linguistics, Minneapolis, Minnesota, 751--761. https://doi.org/10.18653/v1/N19--1082

[33]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3982--3992. https://doi.org/10.18653/v1/D19-1410

[34]

Ritesh Sarkhel and Arnab Nandi. 2019. Deterministic Routing between Layout Abstractions for Multi-Scale Classification of Visually Rich Documents. In IJCAI 2019. International Joint Conferences on Artificial Intelligence Organization, Macao, China, 3360--3366. https://doi.org/10.24963/ijcai.2019/466

[35]

Jiapeng Wang, Chongyu Liu, Lianwen Jin, Guozhi Tang, Jiaxin Zhang, Shuaitao Zhang, Qianying Wang, Yaqiang Wu, and Mingxiang Cai. 2021. Towards Robust Visual Information Extraction in Real World: New Dataset and Novel Solution. In AAAI 2021, Vol. 35. 2738--2745.

[36]

Zilong Wang, Mingjie Zhan, Xuebo Liu, and Ding Liang. 2020. DocStruct: A Multimodal Method to Extract Hierarchy Structure in Document for General Form Understanding. In EMNLP 2020 Findings. Association for Computational Linguistics, Online, 898--908. https://doi.org/10.18653/v1/2020.findings-emnlp.80

[37]

Mengxi Wei, YIfan He, and Qiong Zhang. 2020. Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models. In ACM SIGIR 2020 (SIGIR '20). Association for Computing Machinery, New York, NY, USA, 2367--2376. https://doi.org/10.1145/3397271.3401442

[38]

Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated Residual Transformations for Deep Neural Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5987--5995. https://doi.org/10.1109/CVPR.2017.634

[39]

Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, FuruWei, and Ming Zhou. 2020. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In KDD 2020. 1192--1200. https://doi.org/10.1145/3394486.3403172

Digital Library

[40]

Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, and Furu Wei. 2021. LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding. arXiv:2104.08836 [cs] (April 2021). arXiv:2104.08836 [cs]

[41]

Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou. 2021. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. In ACL 2021. Association for Computational Linguistics, Online, 2579--2591. https://doi.org/10.18653/v1/2021.acl-long.201

[42]

Xiao Yang, Ersin Yumer, Paul Asente, Mike Kraley, Daniel Kifer, and C. Lee Giles. 2017. Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Networks. In CVPR 2017. 4342--4351. https://doi.org/10.1109/CVPR.2017.462

[43]

Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. 2021. Do Transformers Really Perform Bad for Graph Representation. In NeurIPS 2021.

[44]

Wenwen Yu, Ning Lu, Xianbiao Qi, Ping Gong, and Rong Xiao. 2020. PICK: Processing Key Information Extraction from Documents Using Improved Graph Learning-Convolutional Networks. In ICPR 2020.

[45]

Peng Zhang, Yunlu Xu, Zhanzhan Cheng, Shiliang Pu, Jing Lu, Liang Qiao, Yi Niu, and FeiWu. 2020. TRIE: End-to-End Text Reading and Information Extraction for Document Understanding. In ACM MM 2020. ACM, Seattle WA USA, 1413--1422. https://doi.org/10.1145/3394171.3413900

[46]

Xinsong Zhang, Pengshuai Li, and Hang Li. 2021. AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization. In ACL 2021 Findings. Association for Computational Linguistics, Online, 421--435. https://doi.org/10.18653/v1/2021.findings-acl.37

[47]

Yue Zhang, Zhang Bo, Rui Wang, Junjie Cao, Chen Li, and Zuyi Bao. 2021. Entity Relation Extraction as Dependency Parsing in Visually Rich Documents. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2759--2768. https://doi.org/10.18653/v1/2021.emnlp-main. 218

[48]

Xiaohui Zhao, Endi Niu, Zhuo Wu, and Xiaoguang Wang. 2019. CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor. arXiv:1903.12363 [cs] (June 2019). arXiv:1903.12363 (cs).

Cited By

Dai HLi XYin FYan XMei SLiu C(2024)GraphMLLM: A Graph-Based Multi-level Layout Language-Independent Model for Document UnderstandingDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70533-5_14(227-243)Online publication date: 8-Sep-2024
https://doi.org/10.1007/978-3-031-70533-5_14
Li QLi ZCai XDu BZhao HEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Enhancing Visually-Rich Document Understanding via Layout Structure ModelingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612327(4513-4523)Online publication date: 27-Oct-2023
https://doi.org/10.1145/3581783.3612327
Voutharoja BQu LShiri F(2023)Language Independent Neuro-Symbolic Semantic Parsing for Form UnderstandingDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41679-8_8(130-146)Online publication date: 21-Aug-2023
https://dl.acm.org/doi/10.1007/978-3-031-41679-8_8

Index Terms

mmLayout: Multi-grained MultiModal Transformer for Document Understanding
1. Information systems
  1. Information retrieval
    1. Document representation
    2. Retrieval tasks and goals
      1. Information extraction
      2. Question answering

Recommendations

A knowledge-based approach to Chinese archive document understanding
ICDAR '95: Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2

The Chinese archive document possesses special geometrical and logical properties due to its construction based upon rectangular field which contain either title strings or data strings related to some other titles. In this paper, we propose a knowledge-...
Embedding Layout in Text for Document Understanding Using Large Language Models
Document Analysis and Recognition - ICDAR 2024
Abstract
In this paper, we address the challenge of effectively utilizing Large Language Models (LLMs) for Visually Rich Document Understanding (VRDU), a key part of intelligent document processing systems. While LLMs excel in various Natural Language ...
Fusion of visual representations for multimodal information extraction from unstructured transactional documents
Abstract
The importance of automated document understanding in terms of today’s businesses’ speed, efficiency, and cost reduction is indisputable. Although structured and semi-structured business documents have been studied intensively within the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

October 2022

7537 pages

ISBN:9781450392037

DOI:10.1145/3503161

General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSFC
Chinese Knowledge Center for Engineering Sciences and Technology
Fundamental Research Funds for the Central Universities
National Key R&D Program of China
MoE Engineering Research Center of Digital Library

Conference

MM '22

Sponsor:

SIGMM

MM '22: The 30th ACM International Conference on Multimedia

October 10 - 14, 2022

Lisboa, Portugal

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
300
Total Downloads

Downloads (Last 12 months)90
Downloads (Last 6 weeks)10

Reflects downloads up to 14 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Dai HLi XYin FYan XMei SLiu C(2024)GraphMLLM: A Graph-Based Multi-level Layout Language-Independent Model for Document UnderstandingDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70533-5_14(227-243)Online publication date: 8-Sep-2024
https://doi.org/10.1007/978-3-031-70533-5_14
Li QLi ZCai XDu BZhao HEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Enhancing Visually-Rich Document Understanding via Layout Structure ModelingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612327(4513-4523)Online publication date: 27-Oct-2023
https://doi.org/10.1145/3581783.3612327
Voutharoja BQu LShiri F(2023)Language Independent Neuro-Symbolic Semantic Parsing for Form UnderstandingDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41679-8_8(130-146)Online publication date: 21-Aug-2023
https://dl.acm.org/doi/10.1007/978-3-031-41679-8_8

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents