research-article

Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models

Authors:

Qiong ZhangAuthors Info & Claims

SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 2367 - 2376

https://doi.org/10.1145/3397271.3401442

Published: 25 July 2020 Publication History

Abstract

Many business documents processed in modern NLP and IR pipelines are visually rich: in addition to text, their semantics can also be captured by visual traits such as layout, format, and fonts. We study the problem of information extraction from visually rich documents (VRDs) and present a model that combines the power of large pre-trained language models and graph neural networks to efficiently encode both textual and visual information in business documents. We further introduce new fine-tuning objectives to improve in-domain unsupervised fine-tuning to better utilize large amount of unlabeled in-domain data.

We experiment on real world invoice and resume data sets and show that the proposed method outperforms strong text-based RoBERTa baselines by 6.3% absolute F1 on invoices and 4.7% absolute F1 on resumes. When evaluated in a few-shot setting, our method requires up to 30x less annotation data than the baseline to achieve the same level of performance at ~90% F1.

References

[1]

Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and Jimmy Lin. 2019. DocBERT: BERT for Document Classification. CoRRabs/1904.08398 (2019).arXiv:1904.08398 http://arxiv.org/abs/1904.08398

[2]

Abdel Belaïd, Yolande Belaïd, Late N Valverde, and Saddok Kebairi. 2001. Adaptive technology for mail-order form segmentation. In Proceedings of Sixth International Conference on Document Analysis and Recognition. IEEE, 689--693.

[3]

Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). 2019. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics. https://www.aclweb.org/anthology/volumes/N19--1/

[4]

Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. 2016. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2--4, 2016, Conference Track Proceedings, Yoshua Bengio and Yann Le-Cun (Eds.). http://arxiv.org/abs/1511.07289

[5]

Vincent Poulain d'Andecy, Emmanuel Hartmann, and Marçal Rusinol. 2018. Field extraction by hybrid incremental and a-priori structural templates. In 2018 13th IAPR International Workshop on Document Analysis Systems (DAS). IEEE,251--256.

[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Under-standing. CoRRabs/1810.04805 (2018). arXiv:1810.04805http://arxiv.org/abs/1810.04805

[7]

Li Fei-Fei, Rob Fergus, and Pietro Perona. 2006. One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence 28, 4(2006), 594--611.

Digital Library

[8]

Tsu-Jui Fu, Peng-Hsuan Li, and Wei-Yun Ma. 2019. GraphRel: Modeling text as relational graphs for joint entity and relation extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 1409--1418.

[9]

Ruiying Geng, Binhua Li, Yongbin Li, Yuxiao Ye, Ping Jian, and Jian Sun.2019. Few-shot text classification with induction network. arXiv preprintarXiv:1902.10482(2019).

[10]

Ralph Grishman. 2012. Information extraction: Capabilities and challenges. (2012).

[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. IEEE Computer Society, 770--778.https://doi.org/10.1109/CVPR.2016.90

[12]

Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146(2018).

[13]

Daniel Jurafsky and James H. Martin. 2009. Speech and Language Processing (2nd Edition). Prentice-Hall, Inc., USA.

Digital Library

[14]

Anoop Raveendra Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Höhne, and Jean Baptiste Faddoul. 2018. Chargrid: Towards understanding 2d documents. arXiv preprint arXiv:1809.08799(2018).

[15]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980(2014).

[16]

Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24--26, 2017, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=SJU4ayYgl

[17]

Rik Koncel-Kedziorski, Dhanush Bekal, Yi Luan, Mirella Lapata, and Hannaneh Hajishirzi. 2019. Text Generation from Knowledge Graphs with Graph Transformers, See [3], 2284--2293. https://doi.org/10.18653/v1/n19--1238

[18]

Bang Liu, Ting Zhang, Di Niu, Jinghong Lin, Kunfeng Lai, and Yu Xu. 2018. Matching long text documents via graph convolutional networks. arXiv preprint arXiv:1802.07459(2018).

[19]

Xiaojing Liu, Feiyu Gao, Qiong Zhang, and Huasha Zhao. 2019. Graph Convolution for Multimodal Information Extraction from Visually Rich Documents. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 2 (Industry Papers), Anastassia Loukina, Michelle Morales, and Rohit Kumar (Eds.). Association for Computational Linguistics, 32--39. https://doi.org/10.18653/v1/n19--2005

[20]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, DanqiChen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov.2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRRabs/1907.11692 (2019). arXiv:1907.11692http://arxiv.org/abs/1907.11692

[21]

Diego Marcheggiani and Ivan Titov. 2017. Encoding sentences with graph convolutional networks for semantic role labeling.arXiv preprint arXiv:1703.04826(2017).

[22]

Stephen Mayhew, Nitish Gupta, and Dan Roth. 2019. Robust Named Entity Recognition with Truecasing Pretraining. arXiv preprint arXiv:1912.07095(2019).

[23]

Shikib Mehri, Evgeniia Razumovsakaia, Tiancheng Zhao, and Maxine Eskenazi.2019. Pretraining methods for dialog context representation learning. arXivpreprint arXiv:1906.00414(2019).

[24]

Erik G Miller, Nicholas E Matsakis, and Paul A Viola. 2000. Learning from one example through shared densities on transforms. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662), Vol. 1. IEEE, 464--471.

[25]

Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085(2019).

[26]

Yifan Peng, Shankai Yan, and Zhiyong Lu. 2019. Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten bench-marking datasets. arXiv preprint arXiv:1906.05474(2019).

[27]

Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365(2018).

[28]

Yujie Qian, Enrico Santus, Zhijing Jin, Jiang Guo, and Regina Barzilay. 2019. GraphIE: A Graph-Based Framework for Information Extraction, See [3], 751--761.https://doi.org/10.18653/v1/n19--1082

[29]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by genera-tive pre-training. URL https://s3-us-west-2.amazonaws.com/openai-assets/researchcovers/languageunsupervised/languageunderstandingpaper.pdf(2018).

[30]

Marçal Rusinol, Tayeb Benkhelfallah, and Vincent Poulain dAndecy. 2013. Field extraction from administrative documents by incremental structural templates.In2013 12th International Conference on Document Analysis and Recognition. IEEE, 1100--1104.

[31]

KC Santosh and Abdel Belaïd. 2013. Pattern-based approach to table extraction. In Iberian Conference on Pattern Recognition and Image Analysis. Springer, 766--773.

[32]

Daniel Schuster, Klemens Muthmann, Daniel Esser, Alexander Schill, MichaelBerger, Christoph Weidling, Kamil Aliyev, and Andreas Hofmeier. 2013. Intellix--End-User Trained Information Extraction for Document Archiving. In 2013 12th International Conference on Document Analysis and Recognition. IEEE, 101--105.

[33]

Mike Schuster and Kuldip K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Trans. Signal Process.45, 11 (1997), 2673--2681. https://doi.org/10.1109/78.650093

Digital Library

[34]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is Allyou Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4--9 December 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M.Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998--6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need

[35]

Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.https://openreview.net/forum?id=rJXMpikCZ

[36]

Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot learning. In Advances in neural information processing systems. 3630--3638.

[37]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, and Jamie Brew. 2019. HuggingFace's Transformers: State-of-the-art Natural Language Processing. ArXivabs/1910.03771 (2019).

[38]

Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou.2019. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. CoRRabs/1912.13318 (2019). arXiv:1912.13318http://arxiv.org/abs/1912.13318

[39]

Mo Yu, Xiaoxiao Guo, Jinfeng Yi, Shiyu Chang, Saloni Potdar, Yu Cheng, Gerald Tesauro, Haoyu Wang, and Bowen Zhou. 2018. Diverse Few-Shot Text Classification with Multiple Metrics. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1--6, 2018, Volume 1 (Long Papers), Marilyn A. Walker, Heng Ji, and Amanda Stent(Eds.). Association for Computational Linguistics, 1206--1215. https://doi.org/10.18653/v1/n18--1109

Cited By

Yao MLiu ZZhuang LWang LLi H(2024)A Robust Framework for One-Shot Key Information Extraction via Deep Partial Graph MatchingIEEE Transactions on Image Processing10.1109/TIP.2024.335725133(1070-1079)Online publication date: 2024
https://doi.org/10.1109/TIP.2024.3357251
Harshita HDesai PC S(2024)Extraction of skill sets from unstructured documents2024 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT)10.1109/CONECCT62155.2024.10677040(1-6)Online publication date: 12-Jul-2024
https://doi.org/10.1109/CONECCT62155.2024.10677040
The HHoai VYang J(2024)One-Shot Transformer-Based Framework for Visually-Rich Document UnderstandingDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70533-5_15(244-261)Online publication date: 8-Sep-2024
https://doi.org/10.1007/978-3-031-70533-5_15
Show More Cited By

Index Terms

Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
2. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Visual Segmentation for Information Extraction from Heterogeneous Visually Rich Documents
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

Physical and digital documents often contain visually rich information. With such information, there is no strict ordering or positioning in the document where the data values must appear. Along with textual cues, these documents often also rely on ...
Data-Efficient Information Extraction from Documents with Pre-trained Language Models
Document Analysis and Recognition – ICDAR 2021 Workshops
Abstract
Like for many text understanding and generation tasks, pre-trained languages models have emerged as a powerful approach for extracting information from business documents. However, their performance has not been properly studied in data-...
Structural extraction from visual layout of documents
CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management

Most information extraction systems focus on the textual content of the documents. They treat documents as sequences or of words, disregarding the physical and typographical layout of the information.. While this strategy helps in focusing the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2020

2548 pages

ISBN:9781450380164

DOI:10.1145/3397271

General Chairs:
Jimmy Huang
York University, Canada
,
Yi Chang
Jilin University, China
,
Xueqi Cheng
Chinese Academy of Sciences, China
,
Program Chairs:
Jaap Kamps
University of Amsterdam, Netherlands
,
Vanessa Murdock
Amazon, U.S.A.
,
Ji-Rong Wen
Renmin University of China, China
,
Yiqun Liu
Tsinghua University, China

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '20

Sponsor:

SIGIR

SIGIR '20: The 43rd International ACM SIGIR conference on research and development in Information Retrieval

July 25 - 30, 2020

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
344
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)7

Reflects downloads up to 11 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yao MLiu ZZhuang LWang LLi H(2024)A Robust Framework for One-Shot Key Information Extraction via Deep Partial Graph MatchingIEEE Transactions on Image Processing10.1109/TIP.2024.335725133(1070-1079)Online publication date: 2024
https://doi.org/10.1109/TIP.2024.3357251
Harshita HDesai PC S(2024)Extraction of skill sets from unstructured documents2024 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT)10.1109/CONECCT62155.2024.10677040(1-6)Online publication date: 12-Jul-2024
https://doi.org/10.1109/CONECCT62155.2024.10677040
The HHoai VYang J(2024)One-Shot Transformer-Based Framework for Visually-Rich Document UnderstandingDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70533-5_15(244-261)Online publication date: 8-Sep-2024
https://doi.org/10.1007/978-3-031-70533-5_15
Yang JThe HTuan H(2024)Light-Weight Multi-modality Feature Fusion Network for Visually-Rich Document UnderstandingDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70533-5_12(191-207)Online publication date: 8-Sep-2024
https://doi.org/10.1007/978-3-031-70533-5_12
Zening LJiapeng WLianwen J(2023)Visual information extraction deep learning method：a critical reviewJournal of Image and Graphics10.11834/jig.22090428:8(2276-2297)Online publication date: 2023
https://doi.org/10.11834/jig.220904
Silva KSilva TNanayakkara G(2023)microConceptBERT: Concept-Relation Based Document Information Extraction Framework2023 7th SLAAI International Conference on Artificial Intelligence (SLAAI-ICAI)10.1109/SLAAI-ICAI59257.2023.10365022(1-6)Online publication date: 23-Nov-2023
https://doi.org/10.1109/SLAAI-ICAI59257.2023.10365022
Li YJiang WSong S(2023)Review of Semi-Structured Document Information Extraction Techniques Based on Deep Learning2023 2nd International Conference on Machine Learning, Cloud Computing and Intelligent Mining (MLCCIM)10.1109/MLCCIM60412.2023.00022(112-119)Online publication date: 25-Jul-2023
https://doi.org/10.1109/MLCCIM60412.2023.00022
Yao KZhang JQin CSong XWang PZhu HXiong H(2023)ResuFormer: Semantic Structure Understanding for Resumes via Multi-Modal Pre-training2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00242(3154-3167)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00242
Liao HRoyChowdhury ALi WBansal AZhang YTu ZSatzoda RManmatha RMahadevan V(2023)DocTr: Document Transformer for Structured Information Extraction in Documents2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01794(19527-19537)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.01794
Cao HBao CLiu CChen HYin KLiu HLiu YJiang DSun X(2023)Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01788(19460-19470)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.01788
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents