Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3397271.3401442acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models

Published: 25 July 2020 Publication History

Abstract

Many business documents processed in modern NLP and IR pipelines are visually rich: in addition to text, their semantics can also be captured by visual traits such as layout, format, and fonts. We study the problem of information extraction from visually rich documents (VRDs) and present a model that combines the power of large pre-trained language models and graph neural networks to efficiently encode both textual and visual information in business documents. We further introduce new fine-tuning objectives to improve in-domain unsupervised fine-tuning to better utilize large amount of unlabeled in-domain data.
We experiment on real world invoice and resume data sets and show that the proposed method outperforms strong text-based RoBERTa baselines by 6.3% absolute F1 on invoices and 4.7% absolute F1 on resumes. When evaluated in a few-shot setting, our method requires up to 30x less annotation data than the baseline to achieve the same level of performance at ~90% F1.

References

[1]
Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and Jimmy Lin. 2019. DocBERT: BERT for Document Classification. CoRRabs/1904.08398 (2019).arXiv:1904.08398 http://arxiv.org/abs/1904.08398
[2]
Abdel Belaïd, Yolande Belaïd, Late N Valverde, and Saddok Kebairi. 2001. Adaptive technology for mail-order form segmentation. In Proceedings of Sixth International Conference on Document Analysis and Recognition. IEEE, 689--693.
[3]
Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). 2019. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics. https://www.aclweb.org/anthology/volumes/N19--1/
[4]
Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. 2016. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2--4, 2016, Conference Track Proceedings, Yoshua Bengio and Yann Le-Cun (Eds.). http://arxiv.org/abs/1511.07289
[5]
Vincent Poulain d'Andecy, Emmanuel Hartmann, and Marçal Rusinol. 2018. Field extraction by hybrid incremental and a-priori structural templates. In 2018 13th IAPR International Workshop on Document Analysis Systems (DAS). IEEE,251--256.
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Under-standing. CoRRabs/1810.04805 (2018). arXiv:1810.04805http://arxiv.org/abs/1810.04805
[7]
Li Fei-Fei, Rob Fergus, and Pietro Perona. 2006. One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence 28, 4(2006), 594--611.
[8]
Tsu-Jui Fu, Peng-Hsuan Li, and Wei-Yun Ma. 2019. GraphRel: Modeling text as relational graphs for joint entity and relation extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 1409--1418.
[9]
Ruiying Geng, Binhua Li, Yongbin Li, Yuxiao Ye, Ping Jian, and Jian Sun.2019. Few-shot text classification with induction network. arXiv preprintarXiv:1902.10482(2019).
[10]
Ralph Grishman. 2012. Information extraction: Capabilities and challenges. (2012).
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. IEEE Computer Society, 770--778.https://doi.org/10.1109/CVPR.2016.90
[12]
Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146(2018).
[13]
Daniel Jurafsky and James H. Martin. 2009. Speech and Language Processing (2nd Edition). Prentice-Hall, Inc., USA.
[14]
Anoop Raveendra Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Höhne, and Jean Baptiste Faddoul. 2018. Chargrid: Towards understanding 2d documents. arXiv preprint arXiv:1809.08799(2018).
[15]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980(2014).
[16]
Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24--26, 2017, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=SJU4ayYgl
[17]
Rik Koncel-Kedziorski, Dhanush Bekal, Yi Luan, Mirella Lapata, and Hannaneh Hajishirzi. 2019. Text Generation from Knowledge Graphs with Graph Transformers, See [3], 2284--2293. https://doi.org/10.18653/v1/n19--1238
[18]
Bang Liu, Ting Zhang, Di Niu, Jinghong Lin, Kunfeng Lai, and Yu Xu. 2018. Matching long text documents via graph convolutional networks. arXiv preprint arXiv:1802.07459(2018).
[19]
Xiaojing Liu, Feiyu Gao, Qiong Zhang, and Huasha Zhao. 2019. Graph Convolution for Multimodal Information Extraction from Visually Rich Documents. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 2 (Industry Papers), Anastassia Loukina, Michelle Morales, and Rohit Kumar (Eds.). Association for Computational Linguistics, 32--39. https://doi.org/10.18653/v1/n19--2005
[20]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, DanqiChen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov.2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRRabs/1907.11692 (2019). arXiv:1907.11692http://arxiv.org/abs/1907.11692
[21]
Diego Marcheggiani and Ivan Titov. 2017. Encoding sentences with graph convolutional networks for semantic role labeling.arXiv preprint arXiv:1703.04826(2017).
[22]
Stephen Mayhew, Nitish Gupta, and Dan Roth. 2019. Robust Named Entity Recognition with Truecasing Pretraining. arXiv preprint arXiv:1912.07095(2019).
[23]
Shikib Mehri, Evgeniia Razumovsakaia, Tiancheng Zhao, and Maxine Eskenazi.2019. Pretraining methods for dialog context representation learning. arXivpreprint arXiv:1906.00414(2019).
[24]
Erik G Miller, Nicholas E Matsakis, and Paul A Viola. 2000. Learning from one example through shared densities on transforms. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662), Vol. 1. IEEE, 464--471.
[25]
Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085(2019).
[26]
Yifan Peng, Shankai Yan, and Zhiyong Lu. 2019. Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten bench-marking datasets. arXiv preprint arXiv:1906.05474(2019).
[27]
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365(2018).
[28]
Yujie Qian, Enrico Santus, Zhijing Jin, Jiang Guo, and Regina Barzilay. 2019. GraphIE: A Graph-Based Framework for Information Extraction, See [3], 751--761.https://doi.org/10.18653/v1/n19--1082
[29]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by genera-tive pre-training. URL https://s3-us-west-2.amazonaws.com/openai-assets/researchcovers/languageunsupervised/languageunderstandingpaper.pdf(2018).
[30]
Marçal Rusinol, Tayeb Benkhelfallah, and Vincent Poulain dAndecy. 2013. Field extraction from administrative documents by incremental structural templates.In2013 12th International Conference on Document Analysis and Recognition. IEEE, 1100--1104.
[31]
KC Santosh and Abdel Belaïd. 2013. Pattern-based approach to table extraction. In Iberian Conference on Pattern Recognition and Image Analysis. Springer, 766--773.
[32]
Daniel Schuster, Klemens Muthmann, Daniel Esser, Alexander Schill, MichaelBerger, Christoph Weidling, Kamil Aliyev, and Andreas Hofmeier. 2013. Intellix--End-User Trained Information Extraction for Document Archiving. In 2013 12th International Conference on Document Analysis and Recognition. IEEE, 101--105.
[33]
Mike Schuster and Kuldip K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Trans. Signal Process.45, 11 (1997), 2673--2681. https://doi.org/10.1109/78.650093
[34]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is Allyou Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4--9 December 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M.Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998--6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need
[35]
Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.https://openreview.net/forum?id=rJXMpikCZ
[36]
Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot learning. In Advances in neural information processing systems. 3630--3638.
[37]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, and Jamie Brew. 2019. HuggingFace's Transformers: State-of-the-art Natural Language Processing. ArXivabs/1910.03771 (2019).
[38]
Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou.2019. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. CoRRabs/1912.13318 (2019). arXiv:1912.13318http://arxiv.org/abs/1912.13318
[39]
Mo Yu, Xiaoxiao Guo, Jinfeng Yi, Shiyu Chang, Saloni Potdar, Yu Cheng, Gerald Tesauro, Haoyu Wang, and Bowen Zhou. 2018. Diverse Few-Shot Text Classification with Multiple Metrics. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1--6, 2018, Volume 1 (Long Papers), Marilyn A. Walker, Heng Ji, and Amanda Stent(Eds.). Association for Computational Linguistics, 1206--1215. https://doi.org/10.18653/v1/n18--1109

Cited By

View all
  • (2024)A Robust Framework for One-Shot Key Information Extraction via Deep Partial Graph MatchingIEEE Transactions on Image Processing10.1109/TIP.2024.335725133(1070-1079)Online publication date: 2024
  • (2024)Extraction of skill sets from unstructured documents2024 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT)10.1109/CONECCT62155.2024.10677040(1-6)Online publication date: 12-Jul-2024
  • (2024)One-Shot Transformer-Based Framework for Visually-Rich Document UnderstandingDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70533-5_15(244-261)Online publication date: 8-Sep-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2020
2548 pages
ISBN:9781450380164
DOI:10.1145/3397271
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. graph neural networks
  2. structured information extraction
  3. visually rich document

Qualifiers

  • Research-article

Conference

SIGIR '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)36
  • Downloads (Last 6 weeks)7
Reflects downloads up to 11 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A Robust Framework for One-Shot Key Information Extraction via Deep Partial Graph MatchingIEEE Transactions on Image Processing10.1109/TIP.2024.335725133(1070-1079)Online publication date: 2024
  • (2024)Extraction of skill sets from unstructured documents2024 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT)10.1109/CONECCT62155.2024.10677040(1-6)Online publication date: 12-Jul-2024
  • (2024)One-Shot Transformer-Based Framework for Visually-Rich Document UnderstandingDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70533-5_15(244-261)Online publication date: 8-Sep-2024
  • (2024)Light-Weight Multi-modality Feature Fusion Network for Visually-Rich Document UnderstandingDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70533-5_12(191-207)Online publication date: 8-Sep-2024
  • (2023)Visual information extraction deep learning method:a critical reviewJournal of Image and Graphics10.11834/jig.22090428:8(2276-2297)Online publication date: 2023
  • (2023)microConceptBERT: Concept-Relation Based Document Information Extraction Framework2023 7th SLAAI International Conference on Artificial Intelligence (SLAAI-ICAI)10.1109/SLAAI-ICAI59257.2023.10365022(1-6)Online publication date: 23-Nov-2023
  • (2023)Review of Semi-Structured Document Information Extraction Techniques Based on Deep Learning2023 2nd International Conference on Machine Learning, Cloud Computing and Intelligent Mining (MLCCIM)10.1109/MLCCIM60412.2023.00022(112-119)Online publication date: 25-Jul-2023
  • (2023)ResuFormer: Semantic Structure Understanding for Resumes via Multi-Modal Pre-training2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00242(3154-3167)Online publication date: Apr-2023
  • (2023)DocTr: Document Transformer for Structured Information Extraction in Documents2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01794(19527-19537)Online publication date: 1-Oct-2023
  • (2023)Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01788(19460-19470)Online publication date: 1-Oct-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media