Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3539618.3591886acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Form-NLU: Dataset for the Form Natural Language Understanding

Published: 18 July 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Compared to general document analysis tasks, form document structure understanding and retrieval are challenging. Form documents are typically made by two types of authors; A form designer, who develops the form structure and keys, and a form user, who fills out form values based on the provided keys. Hence, the form values may not be aligned with the form designer's intention (structure and keys) if a form user gets confused. In this paper, we introduce Form-NLU, the first novel dataset for form structure understanding and its key and value information extraction, interpreting the form designer's intent and the alignment of user-written value on it. It consists of 857 form images, 6k form keys and values, and 4k table keys and values. Our dataset also includes three form types: digital, printed, and handwritten, which cover diverse form appearances and layouts. We propose a robust positional and logical relation-based form key-value information extraction framework. Using this dataset, Form-NLU, we first examine strong object detection models for the form layout understanding, then evaluate the key information extraction task on the dataset, providing fine-grained results for different types of forms and keys. Furthermore, we examine it with the off-the-shelf pdf layout extraction tool and prove its feasibility in real-world cases.

    Supplemental Material

    MP4 File
    Form understanding presents notable challenges compared with general document understanding because of the uncertainties caused by multiple parties. We introduce Form-NLU, aiming to tackle these challenges by focusing on form structure understanding and extracting key-value information. The dataset comprises 857 form images, encompassing 6,000 form keys and values and 4,000 table keys and values. It covers digital, printed, and handwritten forms, providing diverse appearances and layouts that appeared in the real world. To address this problem, a robust form key-value information extraction framework is proposed, leveraging positional and logical relations. The evaluation includes assessing object detection models for form layout understanding, examining key information extraction performance across various form types and keys, and exploring the feasibility of an off-the-shelf PDF layout extraction tool in real-world scenarios.

    References

    [1]
    Feiqi Cao, Soyeon Caren Han, Siqu Long, Changwei Xu, and Josiah Poon. 2022. Understanding Attention for Vision-and-Language Tasks. In Proceedings of the 29th International Conference on Computational Linguistics. 3438--3453.
    [2]
    Millicent Chang, Raymond da Silva Rosa, and Wilson Ng. 2016. The Informativeness of Substantial Shareholder Trading in the Lead up to a Takeover Bid. In Asian Finance Association (AsianFA) 2016 Conference.
    [3]
    Zewen Chi, Heyan Huang, Heng-Da Xu, Houjin Yu, Wanxuan Yin, and Xian-Ling Mao. 2019. Complicated table structure recognition. arXiv preprint arXiv:1908.04729 (2019).
    [4]
    Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, Vol. 20, 1 (1960), 37--46.
    [5]
    Yihao Ding, Zhe Huang, Runlin Wang, YanHang Zhang, Xianru Chen, Yuzhong Ma, Hyunsuk Chung, and Soyeon Caren Han. 2022. V-Doc: Visual questions answers with Documents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21492--21498.
    [6]
    Zhangxuan Gu, Changhua Meng, Ke Wang, Jun Lan, Weiqiang Wang, Ming Gu, and Liqing Zhang. 2022. Xylayoutlm: Towards layout-aware multimodal networks for visually-rich document understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4583--4592.
    [7]
    Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961--2969.
    [8]
    Ronghang Hu, Amanpreet Singh, Trevor Darrell, and Marcus Rohrbach. 2020. Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9992--10002.
    [9]
    Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. 2022. Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia. 4083--4091.
    [10]
    Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and CV Jawahar. 2019. Icdar 2019 robust reading challenge on scanned receipts ocr and information extraction. In International Conference on Document Analysis Recognition.
    [11]
    Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. 2019. Funsd: A dataset for form understanding in noisy scanned documents. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Vol. 2. IEEE, 1--6.
    [12]
    Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning. PMLR, 5583--5594.
    [13]
    Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019).
    [14]
    Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, and Ming Zhou. 2020. DocBank: A Benchmark Dataset for Document Layout Analysis. In Proceedings of the 28th International Conference on Computational Linguistics. 949--960.
    [15]
    Siwen Luo, Yihao Ding, Siqu Long, Josiah Poon, and Soyeon Caren Han. 2022. Doc-GCN: Heterogeneous Graph Convolutional Networks for Document Layout Analysis. In Proceedings of the 29th International Conference on Computational Linguistics. 2906--2916.
    [16]
    Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. 2021. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2200--2209.
    [17]
    Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee, Jaeheung Surh, Minjoon Seo, and Hwalsuk Lee. 2019. CORD: a consolidated receipt dataset for post-OCR parsing. In Workshop on Document Intelligence at NeurIPS 2019.
    [18]
    Johannes Rausch, Octavio Martinez, Fabian Bissig, Ce Zhang, and Stefan Feuerriegel. 2021. Docparser: Hierarchical document structure parsing from renderings. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 4328--4338.
    [19]
    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, Vol. 28 (2015).
    [20]
    Tomasz Stanisławek, Filip Graliński, Anna Wróblewska, Dawid Lipiński, Agnieszka Kaliska, Paulina Rosalska, Bartosz Topolski, and Przemysław Biecek. 2021. Kleister: key information extraction datasets involving long documents with complex layouts. In International Conference on Document Analysis and Recognition. Springer, 564--579.
    [21]
    Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019).
    [22]
    Ryota Tanaka, Kyosuke Nishida, and Sen Yoshida. 2021. VisualMRC: Machine Reading Comprehension on Document Images. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 13878--13888.
    [23]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
    [24]
    Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. Advances in neural information processing systems, Vol. 28 (2015).
    [25]
    Jiapeng Wang, Chongyu Liu, Lianwen Jin, Guozhi Tang, Jiaxin Zhang, Shuaitao Zhang, Qianying Wang, Yaqiang Wu, and Mingxiang Cai. 2021. Towards robust visual information extraction in real world: new dataset and novel solution. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2738--2745.
    [26]
    Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2020. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1192--1200.
    [27]
    Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, and Furu Wei. 2021a. Layoutxlm: Multimodal pre-training for multilingual visually-rich document understanding. arXiv preprint arXiv:2104.08836 (2021).
    [28]
    Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, and Furu Wei. 2022. XFUND: A Benchmark Dataset for Multilingual Visually Rich Form Understanding. In Findings of the Association for Computational Linguistics: ACL 2022. 3214--3224.
    [29]
    Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, et al. 2021b. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2579--2591.
    [30]
    Yue Zhang, Bo Zhang, Rui Wang, Junjie Cao, Chen Li, and Zuyi Bao. 2021. Entity Relation Extraction as Dependency Parsing in Visually Rich Documents. arXiv preprint arXiv:2110.09915 (2021).
    [31]
    Xinyi Zheng, Douglas Burdick, Lucian Popa, Xu Zhong, and Nancy Xin Ru Wang. 2021. Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context. In Proceedings of the IEEE/CVF winter conference on applications of computer vision. 697--706.
    [32]
    Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. 2020. Image-based table recognition: data, model, and evaluation. In European Conference on Computer Vision. Springer, 564--580.
    [33]
    Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. 2019. Publaynet: largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1015--1022.

    Cited By

    View all
    • (2023)Workshop on Document Intelligence UnderstandingProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615312(5273-5276)Online publication date: 21-Oct-2023
    • (2023)PDF-VQA: A New Dataset for Real-World VQA on PDF DocumentsMachine Learning and Knowledge Discovery in Databases: Applied Data Science and Demo Track10.1007/978-3-031-43427-3_35(585-601)Online publication date: 18-Sep-2023

    Index Terms

    1. Form-NLU: Dataset for the Form Natural Language Understanding

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
      July 2023
      3567 pages
      ISBN:9781450394086
      DOI:10.1145/3539618
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 July 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. datasets
      2. form understanding
      3. natural language understanding

      Qualifiers

      • Research-article

      Conference

      SIGIR '23
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 792 of 3,983 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)187
      • Downloads (Last 6 weeks)41
      Reflects downloads up to 26 Jul 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Workshop on Document Intelligence UnderstandingProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615312(5273-5276)Online publication date: 21-Oct-2023
      • (2023)PDF-VQA: A New Dataset for Real-World VQA on PDF DocumentsMachine Learning and Knowledge Discovery in Databases: Applied Data Science and Demo Track10.1007/978-3-031-43427-3_35(585-601)Online publication date: 18-Sep-2023

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media