Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Region-Level Layout Generation for Multi-level Pre-trained Model Based Visual Information Extraction

  • Conference paper
  • First Online:
Pattern Recognition (ICPR 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15319))

Included in the following conference series:

  • 174 Accesses

Abstract

Multimodal pre-trained models have made significant advancements in the field of visual information extraction by jointly modeling textual, layout, and visual modalities, among which the layout information plays a key role in modeling document inherent structures. However, due to the diversity and complexity of document types and typography styles, it is still not fully studied on how to better model various document layouts comprehensively and hierarchically. Compared with single-level layout adopted by most previous works, multi-level layouts including word-level, segment-level, and region-level layouts can provide a more scientifically modeling of complex document structures. Considering that most existing OCR tools lack region-level layout outputs of high quality, which poses challenges for the utilization of multi-level layout information, we thus propose a region-level layout generation method named ReMe based on hierarchical clustering. By iteratively clustering and merging segment-level bounding boxes, ReMe aims to ensure that semantically related segments with strong correlations share the same region-level bounding boxes. ReMe can be seamlessly integrated into the existing multi-level layout information modeling methods with negligible cost. Experimental results show that after pretrained with only 2 million documents from the IIT-CDIP dataset, the model can achieve new state of the art results on downstream visual information extraction datasets, and the region-level layout information generated by ReMe can significantly enhance the model’s understanding of structured documents, especially the performance on the Relation Extraction task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Appalaraju, S., Jasani, B., Kota, B.U., Xie, Y., Manmatha, R.: DocFormer: end-to-end transformer for document understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 993–1003 (2021)

    Google Scholar 

  2. Chi, Z., et al.: InfoXLM: an information-theoretic framework for cross-lingual language model pre-training. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3576–3588 (2021)

    Google Scholar 

  3. Dai, H., Li, X., Yin, F., Yang, X., Mei, S., Liu, C.: GraphMLLM: a graph-based multi-level layout language-independent model for document understanding. In: International Conference on Document Analysis and Recognition (ICDAR) (2024)

    Google Scholar 

  4. Dhouib, M., Bettaieb, G., Shabou, A.: DocParser: end-to-end OCR-free information extraction from visually rich documents. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds.) ICDAR 2023. LNCS, vol. 14191, pp. 155–172. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-41734-4_10

    Chapter  Google Scholar 

  5. He, J., et al.: ICL-D3IE: in-context learning with diverse demonstrations updating for document information extraction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19485–19494 (2023)

    Google Scholar 

  6. Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., Park, S.: Bros: a pre-trained language model focusing on text and layout for better key information extraction from documents. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 10767–10775 (2022)

    Google Scholar 

  7. Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: LayoutLMv3: pre-training for document AI with unified text and image masking. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 4083–4091 (2022)

    Google Scholar 

  8. Jaume, G., Ekenel, H.K., Thiran, J.P.: FUNSD: a dataset for form understanding in noisy scanned documents. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 2, pp. 1–6. IEEE (2019)

    Google Scholar 

  9. Kenton, J.D.M.W.C., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)

    Google Scholar 

  10. Khan, K., Rehman, S.U., Aziz, K., Fong, S., Sarasvady, S.: DBSCAN: past, present and future. In: The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014), pp. 232–238. IEEE (2014)

    Google Scholar 

  11. Kuang, J., et al.: Visual information extraction in the wild: practical dataset and end-to-end solution. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds.) ICDAR 2023. LNCS, vol. 14192, pp. 36–53. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-41731-3_3

    Chapter  Google Scholar 

  12. Lewis, D., Agam, G., Argamon, S., Frieder, O., Grossman, D., Heard, J.: Building a test collection for complex document information processing. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 665–666 (2006)

    Google Scholar 

  13. Li, C., et al.: StructuralLM: structural pre-training for form understanding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 6309–6318 (2021)

    Google Scholar 

  14. Li, P., et al.: SelfDoc: self-supervised document representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5652–5660 (2021)

    Google Scholar 

  15. Li, Y., et al.: Structext: sStructured text understanding with multi-modal transformers. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1912–1920 (2021)

    Google Scholar 

  16. Liao, H., et al.: Doctr: document transformer for structured information extraction in documents. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19584–19594 (2023)

    Google Scholar 

  17. Liu, Y., et al.: Roberta: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  18. Luo, C., Cheng, C., Zheng, Q., Yao, C.: GeoLayoutLM: geometric pre-training for visual information extraction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7092–7101 (2023)

    Google Scholar 

  19. Park, S., et al.: Cord: a consolidated receipt dataset for post-OCR parsing. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)

    Google Scholar 

  20. Tang, Z., et al.: Unifying vision, text, and layout for universal document processing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19254–19264 (2023)

    Google Scholar 

  21. Tu, Y., Guo, Y., Chen, H., Tang, J.: Layoutmask: enhance text-layout interaction in multi-modal pre-training for document understanding. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15200–15212 (2023)

    Google Scholar 

  22. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  23. Wang, D., Ma, Z., Nourbakhsh, A., Gu, K., Shah, S.: DocGraphLM: documental graph language model for information extraction. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1944–1948 (2023)

    Google Scholar 

  24. Wang, J., Jin, L., Ding, K.: LiLT: a simple yet effective language-independent layout transformer for structured document understanding. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7747–7757 (2022)

    Google Scholar 

  25. Wang, W., et al.: Ernie-mmlayout: multi-grained multimodal transformer for document understanding. arXiv preprint arXiv:2209.08569 (2022)

  26. Xu, Y., et al.: Layoutlmv2: multi-modal pre-training for visually-rich document understanding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 2579–2591 (2021)

    Google Scholar 

  27. Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: Layoutlm: pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200 (2020)

    Google Scholar 

  28. Xu, Y., et al.: LayoutxLM: multimodal pre-training for multilingual visually-rich document understanding. arXiv preprint arXiv:2104.08836 (2021)

  29. Xu, Y., et al.: Xfund: a benchmark dataset for multilingual visually rich form understanding. In: Findings of the Association for Computational Linguistics: ACL 2022, pp. 3214–3224 (2022)

    Google Scholar 

  30. Zhang, Z., Ma, J., Du, J., Wang, L., Zhang, J.: Multimodal pre-training based on graph attention network for document understanding. IEEE Trans. Multimed. (2022)

    Google Scholar 

Download references

Acknowledgements

This work is supported by the National Key Research and Development Program Grant 2020AAA0109700, the Youth Program of State Key Laboratory of Multimodal Artificial Intelligence Systems, and the National Natural Science Foundation of China(NSFC) Grant U23B2029.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lin-Lin Huang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, S., Li, XH., Yin, F., Huang, LL. (2025). Region-Level Layout Generation for Multi-level Pre-trained Model Based Visual Information Extraction. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15319. Springer, Cham. https://doi.org/10.1007/978-3-031-78495-8_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-78495-8_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-78494-1

  • Online ISBN: 978-3-031-78495-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics