Long-Range Transformer Architectures for Document Understanding

Douzon, Thibault; Duffner, Stefan; Garcia, Christophe; Espinas, Jérémy

doi:10.1007/978-3-031-41501-2_4

Thibault Douzon^9,10,
Stefan Duffner⁹,
Christophe Garcia⁹ &
…
Jérémy Espinas¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14194))

Included in the following conference series:

International Conference on Document Analysis and Recognition

547 Accesses

Abstract

Since their release, Transformers have revolutionized many fields from Natural Language Understanding to Computer Vision. Document Understanding (DU) was not left behind with first Transformer based models for DU dating from late 2019. However, the computational complexity of the self-attention operation limits their capabilities to small sequences. In this paper we explore multiple strategies to apply Transformer based models to long multi-page documents. We introduce 2 new multi-modal (text + layout) long-range models for DU. They are based on efficient implementations of Transformers for long sequences. Long-range models can process whole documents at once effectively and are less impaired by the document’s length. We compare them to LayoutLM, a classical Transformer adapted for DU and pre-trained on millions of documents. We further propose 2D relative attention bias to guide self-attention towards relevant tokens without harming model efficiency. We observe improvements on multi-page business documents on Information Retrieval for a small performance cost on smaller sequences. Relative 2D attention revealed to be effective on dense text for both normal and long-range models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

LSG Attention: Extrapolation of Pretrained Transformers to Long Sequences

ICDAR 2023 Competition on Document UnderstanDing of Everything (DUDE)

Faster DAN: Multi-target Queries with Document Positional Encoding for End-to-End Handwritten Document Recognition

Notes

1.
Models implementation and weights available at https://github.com/thibaultdouzon/long-range-document-transformer.
2.
Squircle are intermediate shape between square and circle, see https://en.wikipedia.org/wiki/Squircle. Contours of the surface described by $B^{\textrm{squircle}}$ is not actually a squircle but also range from square to circle.

References

Ainslie, J., et al.: ETC: encoding long and structured inputs in transformers. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 268–284. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.emnlp-main.19. https://aclanthology.org/2020.emnlp-main.19
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate (2016). http://arxiv.org/abs/1409.0473. Type: article
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer (2020). http://arxiv.org/abs/2004.05150. Type: article
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates Inc. (2020). https://papers.nips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
Chowdhery, A., et al.: PaLM: Scaling Language Modeling with Pathways (2022)
Google Scholar
Denk, T.I., Reisswig, C.: BERTgrid: contextualized embedding for 2D document representation and understanding. arXiv preprint arXiv:1909.04948 (2019)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota (Volume 1: Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423
Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for document image classification and retrieval. In: International Conference on Document Analysis and Recognition (2015). http://arxiv.org/abs/1502.07058. Type: article
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735. ISSN 0899-7667
Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: LayoutLMv3: pre-training for document AI with unified text and image masking (2022). http://arxiv.org/abs/2204.08387. Type: article
Huang, Z., et al.: ICDAR2019 competition on scanned receipt OCR and information extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1516–1520 (2019). https://doi.org/10.1109/ICDAR.2019.00244. http://arxiv.org/abs/2103.10213
Katti, A.R., et al.: Chargrid: towards understanding 2D documents. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 4459–4469. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/D18-1476. https://aclanthology.org/D18-1476
Kim, G., et al.: OCR-free document understanding transformer (2022). http://arxiv.org/abs/2111.15664. Type: article
Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: the efficient transformer (2020). http://arxiv.org/abs/2001.04451. Type: article
Kolesnikov, A., et al.: An image is worth $16 \times 16$ words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Google Scholar
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 260–270. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/N16-1030. https://aclanthology.org/N16-1030
Lewis, D., Agam, G., Argamon, S., Frieder, O., Grossman, D., Heard, J.: Building a test collection for complex document information processing. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2006, pp. 665–666. Association for Computing Machinery, New York (2006). https://doi.org/10.1145/1148170.1148307. ISBN 9781595933690
Li, M., et al.: DocBank: a benchmark dataset for document layout analysis. In: Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, pp. 949–960. International Committee on Computational Linguistics (Online) (2020). https://doi.org/10.18653/v1/2020.coling-main.82. https://aclanthology.org/2020.coling-main.82
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach (2019). http://arxiv.org/abs/1907.11692. Type: article
Mathew, M., Karatzas, D., Jawahar, C.V.: DocVQA: a dataset for VQA on document images. In: Winter Conference on Applications of Computer Vision, pp. 2200–2209 (2021). https://openaccess.thecvf.com/content/WACV2021/html/Mathew_DocVQA_A_Dataset_for_VQA_on_Document_Images_WACV_2021_paper.html
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, vol. 26. Curran Associates Inc. (2013). https://papers.nips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/D14-1162. https://aclanthology.org/D14-1162
Peters, M.E., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana (Volume 1: Long Papers), pp. 2227–2237. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/N18-1202. https://aclanthology.org/N18-1202
Powalski, R., Borchmann, L., Jurkiewicz, D., Dwojak, T., Pietruszka, M., Pałka, G.: Going full-TILT boogie on document understanding with text-image-layout transformer (2021). http://arxiv.org/abs/2102.09550. Type: article
Qin, Z., et al.: cosFormer: rethinking softmax in attention. In: International Conference on Learning Representations (2022). https://doi.org/10.48550/arXiv.2202.08791. Type: article
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019)
Google Scholar
Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana (Volume 2: Short Papers), pp. 464–468. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/N18-2074. https://aclanthology.org/N18-2074
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30. Curran Associates Inc. (2017). https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, pp. 353–355. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/W18-5446. https://aclanthology.org/W18-5446
Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H.: Linformer: self-attention with linear complexity (2020). http://arxiv.org/abs/2006.04768. Type: article
Xu, Y., et al.: LayoutLMv2: multi-modal pre-training for visually-rich document understanding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 2579–2591. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.acl-long.201. https://aclanthology.org/2021.acl-long.201
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2020, pp. 1192–1200. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3394486.3403172. ISBN 9781450379984
Yu, W., Lu, N., Qi, X., Gong, P., Xiao, R.: PICK: processing key information extraction from documents using improved graph learning-convolutional networks (2020). http://arxiv.org/abs/2004.07464. Type: article
Zaheer, M., et al.: Big bird: transformers for longer sequences. In: Advances in Neural Information Processing Systems, vol. 33, pp. 17283–17297. Curran Associates Inc. (2020). https://proceedings.neurips.cc/paper/2020/hash/c8512d142a2d849725f31a9a7a361ab9-Abstract.html

Download references

Author information

Authors and Affiliations

INSA Lyon, LIRIS, Lyon, France
Thibault Douzon, Stefan Duffner & Christophe Garcia
Esker, Lyon, France
Thibault Douzon & Jérémy Espinas

Authors

Thibault Douzon
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Duffner
View author publications
You can also search for this author in PubMed Google Scholar
Christophe Garcia
View author publications
You can also search for this author in PubMed Google Scholar
Jérémy Espinas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thibault Douzon .

Editor information

Editors and Affiliations

University of La Rochelle, La Rochelle, France
Mickael Coustaty
Autonomous University of Barcelona, Bellaterra, Spain
Alicia Fornés

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Douzon, T., Duffner, S., Garcia, C., Espinas, J. (2023). Long-Range Transformer Architectures for Document Understanding. In: Coustaty, M., Fornés, A. (eds) Document Analysis and Recognition – ICDAR 2023 Workshops. ICDAR 2023. Lecture Notes in Computer Science, vol 14194. Springer, Cham. https://doi.org/10.1007/978-3-031-41501-2_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-41501-2_4
Published: 15 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41500-5
Online ISBN: 978-3-031-41501-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Long-Range Transformer Architectures for Document Understanding

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

LSG Attention: Extrapolation of Pretrained Transformers to Long Sequences

ICDAR 2023 Competition on Document UnderstanDing of Everything (DUDE)

Faster DAN: Multi-target Queries with Document Positional Encoding for End-to-End Handwritten Document Recognition

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

Long-Range Transformer Architectures for Document Understanding

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

LSG Attention: Extrapolation of Pretrained Transformers to Long Sequences

ICDAR 2023 Competition on Document UnderstanDing of Everything (DUDE)

Faster DAN: Multi-target Queries with Document Positional Encoding for End-to-End Handwritten Document Recognition

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation