Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Light-Weight Multi-modality Feature Fusion Network for Visually-Rich Document Understanding

  • Conference paper
  • First Online:
Document Analysis and Recognition - ICDAR 2024 (ICDAR 2024)

Abstract

Entity extraction (EE) is an important task in visually-rich document understanding (VrDU) which leverages multi-modal features of text, layout, and image. Recent transformer-based architectures enable an effective fusion of these features, showing great performance on the EE task. However, these models are heavy, leading to substantially high training cost and low inference speed. Thus, we propose a light-weight transformer-based model (named LMFFN) with a novel layout-self-attention layout-aware multi-modal fusion mechanism that allows an efficient entity extraction. Specifically, the proposed framework uses just a simple pre-training objective coupled with an effective batch implementation. In addition, no constraints are required with regard to the input sequence length or the reading order. This relaxation gives our model an advantage when it comes to camera and skewed documents, as we observed a 7% F1-score improvement when we compared our model to previous SOTA models on camera data. Evaluation results of three public datasets (CORD, SROIE, and XFUND) show that our proposed architecture achieves competitive performance compared to recent SOTA models while having 5 to 10 times fewer parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Adali, S., Sonmez, A.C., Gokturk, M.: An integrated architecture for processing business documents in Turkish. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 394–405. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00382-0_32

    Chapter  Google Scholar 

  2. Appalaraju, S., Jasani, B., Kota, B.U., Xie, Y., Manmatha, R.: Docformer: end-to-end transformer for document understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 993–1003 (2021)

    Google Scholar 

  3. Bao, H., et al.: Unilmv2: pseudo-masked language models for unified language model pre-training. In: International Conference on Machine Learning, pp. 642–652. PMLR (2020)

    Google Scholar 

  4. Belaïd, Y., Belaïd, A.: Morphological tagging approach in document analysis of invoices. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004, vol. 1, pp. 469–472. IEEE (2004)

    Google Scholar 

  5. Carbonell, M., Riba, P., Villegas, M., Fornés, A., Lladós, J.: Named entity recognition and relation extraction with graph neural networks in semi structured documents. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 9622–9627. IEEE (2021)

    Google Scholar 

  6. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)

  7. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 764–773 (2017)

    Google Scholar 

  8. Dengel, A.R., Klein, B.: smartFIX: a requirements-driven system for document analysis and understanding. In: Lopresti, D., Hu, J., Kashi, R. (eds.) DAS 2002. LNCS, vol. 2423, pp. 433–444. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45869-7_47

    Chapter  Google Scholar 

  9. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  10. Esser, D., Schuster, D., Muthmann, K., Berger, M., Schill, A.: Automatic indexing of scanned documents: a layout-based approach. In: Document recognition and retrieval XIX, vol. 8297, pp. 118–125. SPIE (2012)

    Google Scholar 

  11. Gori, M., Monfardini, G., Scarselli, F.: A new model for learning in graph domains. In: Proceedings. 2005 IEEE International Joint Conference on Neural Networks, vol. 2, pp. 729–734 (2005)

    Google Scholar 

  12. Gui, T., Zou, Y., Zhang, Q., Peng, M., Fu, J., Wei, Z., Huang, X.J.: A lexicon-based graph neural network for chinese ner. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 1040–1050 (2019)

    Google Scholar 

  13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  14. Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., Park, S.: Bros: a pre-trained language model for understanding texts in document (2020)

    Google Scholar 

  15. Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: Layoutlmv3: pre-training for document ai with unified text and image masking. arXiv preprint arXiv:2204.08387 (2022)

  16. Huang, Z., Chen, K., He, J., Bai, X., Karatzas, D., Lu, S., Jawahar, C.: Icdar2019 competition on scanned receipt ocr and information extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1516–1520. IEEE (2019)

    Google Scholar 

  17. Hwang, W., Yim, J., Park, S., Yang, S., Seo, M.: Spatial dependency parsing for semi-structured document information extraction. arXiv preprint arXiv:2005.00642 (2020)

  18. Jaume, G., Ekenel, H.K., Thiran, J.P.: Funsd: a dataset for form understanding in noisy scanned documents. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 2, pp. 1–6. IEEE (2019)

    Google Scholar 

  19. Jiang, H., Misra, I., Rohrbach, M., Learned-Miller, E., Chen, X.: In defense of grid features for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10267–10276 (2020)

    Google Scholar 

  20. Katti, A.R., Reisswig, C., Guder, C., Brarda, S., Bickel, S., Höhne, J., Faddoul, J.B.: Chargrid: Towards understanding 2d documents. arXiv preprint arXiv:1809.08799 (2018)

  21. Khan, S., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S., Shah, M.: Transformers in vision: a survey. ACM Computing Surveys (CSUR) (2021)

    Google Scholar 

  22. Klein, B., Agne, S., Dengel, A.: Results of a Study on invoice-reading systems in Germany. In: Marinai, S., Dengel, A.R. (eds.) DAS 2004. LNCS, vol. 3163, pp. 451–462. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-28640-0_43

    Chapter  Google Scholar 

  23. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016)

  24. Li, Y., Qian, Y., Yu, Y., Qin, X., Zhang, C., Liu, Y., Yao, K., Han, J., Liu, J., Ding, E.: Structext: Structured text understanding with multi-modal transformers. In: Proceedings of the 29th ACM International Conference on Multimedia. pp. 1912–1920 (2021)

    Google Scholar 

  25. Liao, M., Wan, Z., Yao, C., Chen, K., Bai, X.: Real-time scene text detection with differentiable binarization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11474–11481 (2020)

    Google Scholar 

  26. Liu, X., Gao, F., Zhang, Q., Zhao, H.: Graph convolution for multimodal information extraction from visually rich documents. arXiv preprint arXiv:1903.11279 (2019)

  27. Majumder, B., Potti, N., Tata, S., Wendt, J.B., Zhao, Q., Najork, M.: Representation learning for information extraction from form-like documents (2020)

    Google Scholar 

  28. Nguyen, T.-A.D., Vu, H.M., Son, N.H., Nguyen, M.-T.: A span extraction approach for information extraction on visually-rich documents. In: Barney Smith, E.H., Pal, U. (eds.) ICDAR 2021. LNCS, vol. 12917, pp. 353–363. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86159-9_25

    Chapter  Google Scholar 

  29. Palm, R.B., Winther, O., Laws, F.: Cloudscan-a configuration-free invoice analysis system using recurrent neural networks. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 406–413. IEEE (2017)

    Google Scholar 

  30. Park, S., Shin, S., Lee, B., Lee, J., Surh, J., Seo, M., Lee, H.: Cord: a consolidated receipt dataset for post-ocr parsing. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)

    Google Scholar 

  31. Qian, Y., Santus, E., Jin, Z., Guo, J., Barzilay, R.: Graphie: a graph-based framework for information extraction. arXiv preprint arXiv:1810.13083 (2018)

  32. Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019)

  33. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015)

    Google Scholar 

  34. Sage, C., Aussem, A., Elghazel, H., Eglin, V., Espinas, J.: Recurrent neural network approach for table field extraction in business documents. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1308–1313. IEEE (2019)

    Google Scholar 

  35. Tang, G., et al.: Matchvie: exploiting match relevancy between entities for visual information extraction. arXiv preprint arXiv:2106.12940 (2021)

  36. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)

  37. Wang, J., et al.: Towards robust visual information extraction in real world: new dataset and novel solution. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2738–2745 (2021)

    Google Scholar 

  38. Wang, Z., Shang, J.: Towards few-shot entity recognition in document images: a label-aware sequence-to-sequence framework. arXiv preprint arXiv:2204.05819 (2022)

  39. Wei, M., He, Y., Zhang, Q.: Robust layout-aware IE for visually rich documents with pre-trained language models. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2367–2376 (2020)

    Google Scholar 

  40. Xu, Y., et al.: Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. arXiv preprint arXiv:2012.14740 (2020)

  41. Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: Layoutlm: pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200 (2020)

    Google Scholar 

  42. Xu, Y., et al.: Layoutxlm: multimodal pre-training for multilingual visually-rich document understanding. arXiv preprint arXiv:2104.08836 (2021)

  43. Xu, Y., et al.: Xfund: a benchmark dataset for multilingual visually rich form understanding. In: Findings of the Association for Computational Linguistics: ACL 2022, pp. 3214–3224 (2022)

    Google Scholar 

  44. Yu, W., Lu, N., Qi, X., Gong, P., Xiao, R.: Pick: processing key information extraction from documents using improved graph learning-convolutional networks. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4363–4370. IEEE (2021)

    Google Scholar 

  45. Zhang, P., et al.: Trie: end-to-end text reading and information extraction for document understanding. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1413–1422 (2020)

    Google Scholar 

Download references

Acknowledgments

We acknowledge Cinnamon AI for supporting this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huynh Vu The .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yang, J., The, H.V., Tuan, H.L. (2024). Light-Weight Multi-modality Feature Fusion Network for Visually-Rich Document Understanding. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds) Document Analysis and Recognition - ICDAR 2024. ICDAR 2024. Lecture Notes in Computer Science, vol 14804. Springer, Cham. https://doi.org/10.1007/978-3-031-70533-5_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70533-5_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70532-8

  • Online ISBN: 978-3-031-70533-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics