Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Long-Range Transformer Architectures for Document Understanding

  • Conference paper
  • First Online:
Document Analysis and Recognition – ICDAR 2023 Workshops (ICDAR 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14194))

Included in the following conference series:

  • 547 Accesses

Abstract

Since their release, Transformers have revolutionized many fields from Natural Language Understanding to Computer Vision. Document Understanding (DU) was not left behind with first Transformer based models for DU dating from late 2019. However, the computational complexity of the self-attention operation limits their capabilities to small sequences. In this paper we explore multiple strategies to apply Transformer based models to long multi-page documents. We introduce 2 new multi-modal (text + layout) long-range models for DU. They are based on efficient implementations of Transformers for long sequences. Long-range models can process whole documents at once effectively and are less impaired by the document’s length. We compare them to LayoutLM, a classical Transformer adapted for DU and pre-trained on millions of documents. We further propose 2D relative attention bias to guide self-attention towards relevant tokens without harming model efficiency. We observe improvements on multi-page business documents on Information Retrieval for a small performance cost on smaller sequences. Relative 2D attention revealed to be effective on dense text for both normal and long-range models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Models implementation and weights available at https://github.com/thibaultdouzon/long-range-document-transformer.

  2. 2.

    Squircle are intermediate shape between square and circle, see https://en.wikipedia.org/wiki/Squircle. Contours of the surface described by \(B^{\textrm{squircle}}\) is not actually a squircle but also range from square to circle.

References

  1. Ainslie, J., et al.: ETC: encoding long and structured inputs in transformers. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 268–284. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.emnlp-main.19. https://aclanthology.org/2020.emnlp-main.19

  2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate (2016). http://arxiv.org/abs/1409.0473. Type: article

  3. Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer (2020). http://arxiv.org/abs/2004.05150. Type: article

  4. Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates Inc. (2020). https://papers.nips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html

  5. Chowdhery, A., et al.: PaLM: Scaling Language Modeling with Pathways (2022)

    Google Scholar 

  6. Denk, T.I., Reisswig, C.: BERTgrid: contextualized embedding for 2D document representation and understanding. arXiv preprint arXiv:1909.04948 (2019)

  7. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota (Volume 1: Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423

  8. Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for document image classification and retrieval. In: International Conference on Document Analysis and Recognition (2015). http://arxiv.org/abs/1502.07058. Type: article

  9. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735. ISSN 0899-7667

  10. Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: LayoutLMv3: pre-training for document AI with unified text and image masking (2022). http://arxiv.org/abs/2204.08387. Type: article

  11. Huang, Z., et al.: ICDAR2019 competition on scanned receipt OCR and information extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1516–1520 (2019). https://doi.org/10.1109/ICDAR.2019.00244. http://arxiv.org/abs/2103.10213

  12. Katti, A.R., et al.: Chargrid: towards understanding 2D documents. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 4459–4469. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/D18-1476. https://aclanthology.org/D18-1476

  13. Kim, G., et al.: OCR-free document understanding transformer (2022). http://arxiv.org/abs/2111.15664. Type: article

  14. Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: the efficient transformer (2020). http://arxiv.org/abs/2001.04451. Type: article

  15. Kolesnikov, A., et al.: An image is worth \(16 \times 16\) words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

    Google Scholar 

  16. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 260–270. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/N16-1030. https://aclanthology.org/N16-1030

  17. Lewis, D., Agam, G., Argamon, S., Frieder, O., Grossman, D., Heard, J.: Building a test collection for complex document information processing. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2006, pp. 665–666. Association for Computing Machinery, New York (2006). https://doi.org/10.1145/1148170.1148307. ISBN 9781595933690

  18. Li, M., et al.: DocBank: a benchmark dataset for document layout analysis. In: Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, pp. 949–960. International Committee on Computational Linguistics (Online) (2020). https://doi.org/10.18653/v1/2020.coling-main.82. https://aclanthology.org/2020.coling-main.82

  19. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach (2019). http://arxiv.org/abs/1907.11692. Type: article

  20. Mathew, M., Karatzas, D., Jawahar, C.V.: DocVQA: a dataset for VQA on document images. In: Winter Conference on Applications of Computer Vision, pp. 2200–2209 (2021). https://openaccess.thecvf.com/content/WACV2021/html/Mathew_DocVQA_A_Dataset_for_VQA_on_Document_Images_WACV_2021_paper.html

  21. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, vol. 26. Curran Associates Inc. (2013). https://papers.nips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html

  22. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/D14-1162. https://aclanthology.org/D14-1162

  23. Peters, M.E., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana (Volume 1: Long Papers), pp. 2227–2237. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/N18-1202. https://aclanthology.org/N18-1202

  24. Powalski, R., Borchmann, L., Jurkiewicz, D., Dwojak, T., Pietruszka, M., Pałka, G.: Going full-TILT boogie on document understanding with text-image-layout transformer (2021). http://arxiv.org/abs/2102.09550. Type: article

  25. Qin, Z., et al.: cosFormer: rethinking softmax in attention. In: International Conference on Learning Representations (2022). https://doi.org/10.48550/arXiv.2202.08791. Type: article

  26. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019)

    Google Scholar 

  27. Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana (Volume 2: Short Papers), pp. 464–468. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/N18-2074. https://aclanthology.org/N18-2074

  28. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30. Curran Associates Inc. (2017). https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

  29. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, pp. 353–355. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/W18-5446. https://aclanthology.org/W18-5446

  30. Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H.: Linformer: self-attention with linear complexity (2020). http://arxiv.org/abs/2006.04768. Type: article

  31. Xu, Y., et al.: LayoutLMv2: multi-modal pre-training for visually-rich document understanding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 2579–2591. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.acl-long.201. https://aclanthology.org/2021.acl-long.201

  32. Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2020, pp. 1192–1200. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3394486.3403172. ISBN 9781450379984

  33. Yu, W., Lu, N., Qi, X., Gong, P., Xiao, R.: PICK: processing key information extraction from documents using improved graph learning-convolutional networks (2020). http://arxiv.org/abs/2004.07464. Type: article

  34. Zaheer, M., et al.: Big bird: transformers for longer sequences. In: Advances in Neural Information Processing Systems, vol. 33, pp. 17283–17297. Curran Associates Inc. (2020). https://proceedings.neurips.cc/paper/2020/hash/c8512d142a2d849725f31a9a7a361ab9-Abstract.html

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thibault Douzon .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Douzon, T., Duffner, S., Garcia, C., Espinas, J. (2023). Long-Range Transformer Architectures for Document Understanding. In: Coustaty, M., Fornés, A. (eds) Document Analysis and Recognition – ICDAR 2023 Workshops. ICDAR 2023. Lecture Notes in Computer Science, vol 14194. Springer, Cham. https://doi.org/10.1007/978-3-031-41501-2_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-41501-2_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-41500-5

  • Online ISBN: 978-3-031-41501-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics