Fine-Tuning Vision Encoder–Decoder Transformers for Handwriting Text Recognition on Historical Documents

Parres, Daniel; Paredes, Roberto

doi:10.1007/978-3-031-41685-9_16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14190))

Included in the following conference series:

International Conference on Document Analysis and Recognition

1201 Accesses
8 Citations

Abstract

Handwritten text recognition (HTR) has seen significant advancements in recent years, mainly due to the incorporation of deep learning techniques. One area of HTR that has garnered particular interest is the transcription of historical documents, as there is a vast amount of records available that have yet to be processed, potentially resulting in a loss of information due to deterioration.

Currently, the most widely used HTR approach is to train convolutional recurrent neural networks (CRNN) with connectionist temporal classification loss. Additionally, language models based on n-grams are often utilized in conjunction with CRNNs. While transformer models have revolutionized natural language processing, they have yet to be widely adopted in the context of HTR for historical documents.

In this paper, we propose a new approach for HTR on historical documents that involves fine-tuning pre-trained transformer models, specifically vision encoder–decoder models. This approach presents several challenges, including the limited availability of large amounts of training data for specific HTR tasks. We explore various strategies for initializing and training transformer models and present a model that outperforms existing state-of-the-art methods on three different datasets. Specifically, our proposed model achieves a word error rate of 6.9% on the ICFHR 2014 Bentham dataset, 14.5% on the ICFHR 2016 Ratsprotokolle dataset, and 17.3% on the Saint Gall dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Handwritten Document Recognition Using Pre-trained Vision Transformers

Refocus attention span networks for handwriting line recognition

Article 25 December 2022

Recognizing text lines in handwritten archival document images using octave convolutional and attention recurrent neural networks

Article 09 July 2024

References

Abdallah, A., Hamada, M., Nurseitov, D.: Attention-based fully gated CNN-BGRU for Russian handwritten text. J. Imaging 6, 141 (2020)
Article Google Scholar
Augustin, E., Carré, M., Grosicki, E., Brodin, J.M., Geoffrois, E., Preteux, F.: RIMES evaluation campaign for handwritten mail processing. In: Proceedings of the International Workshop on Frontiers in Handwriting Recognition, pp. 231–235 (2006)
Google Scholar
Barrere, K., Soullard, Y., Lemaitre, A., Coüasnon, B.: A light transformer-based architecture for handwritten text recognition. In: Proceedings of the Document Analysis Systems, pp. 275–290 (2022)
Google Scholar
Bluche, T.: Deep neural networks for large vocabulary handwritten text recognition, Ph.D. thesis, Université Paris-Sud (2015)
Google Scholar
Bluche, T., Messina, R.: Gated convolutional recurrent neural networks for multilingual handwriting recognition. In: Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition, pp. 646–651 (2017)
Google Scholar
Bunke, H., Roth, M., Schukat-Talamazzini, E.: Off-line cursive handwriting recognition using hidden markov models. Pattern Recogn. 28, 1399–1413 (1995)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Fischer, A., Indermühle, E., Bunke, H., Viehhauser, G., Stolz, M.: Ground truth creation for handwriting recognition in historical documents. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, pp. 3–10 (2010)
Google Scholar
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
Article Google Scholar
Hu, J., Gek Lim, S., Brown, M.K.: Writer independent on-line handwriting recognition using an HMM approach. Pattern Recogn. 33, 133–147 (2000)
Article Google Scholar
Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: MDETR - modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1780–1790 (2021)
Google Scholar
Kang, L., Riba, P., Rusiñol, M., Fornés, A., Villegas, M.: Pay attention to what you read: non-recurrent handwritten text-line recognition. Pattern Recogn. 129, 108766 (2022)
Article Google Scholar
Li, M., et al.: TrOCR: transformer-based optical character recognition with pre-trained models. arXiv preprint arXiv:2109.10282 (2021)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Marti, U.V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition. Int. J. Doc. Anal. Recogn. 5, 39–46 (2002)
Article MATH Google Scholar
Puigcerver, J.: Are multidimensional recurrent layers really necessary for handwritten text recognition? In: Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition, pp. 67–72 (2017)
Google Scholar
Sai Suryateja, S., Veerraju, P., Vijay Kumar Naidu, P., Ravi Kumar, C.V.: Improvement in efficiency of the state-of-the-art handwritten text recognition models. Turkish J. Comput. Math. Educ. 12, 7549–7556 (2021)
Google Scholar
Shonenkov, A., Karachev, D., Novopoltsev, M., Potanin, M., Dimitrov, D.: StackMix and Blot augmentations for handwritten text recognition. arXiv preprint arXiv:2108.11667 (2021)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Singh, S.S., Karayev, S.: Full page handwriting recognition via image to sequence extraction. In: Proceedings of the Document Analysis and Recognition - International Conference on Document Analysis and Recognition, pp. 55–69 (2021)
Google Scholar
de Sousa Neto, A.F., Bezerra, B.L.D., Toselli, A.H., Lima, E.B.: HTR-Flor: a deep learning system for offline handwritten text recognition. In: Proceedings of the 33rd Brazilian Symposium on Computer Graphics and Image Processing Conference on Graphics, Patterns and Images, pp. 54–61 (2020)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the Inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Google Scholar
Sánchez, J.A., Romero, V., Toselli, A.H., Vidal, E.: ICFHR2014 competition on handwritten text recognition on Transcriptorium datasets (HTRtS). In: Proceedings of the 14th International Conference on Frontiers in Handwriting Recognition, pp. 785–790 (2014)
Google Scholar
Sánchez, J.A., Romero, V., Toselli, A.H., Vidal, E.: ICFHR2016 competition on handwritten text recognition on the READ dataset. In: Proceedings of the 15th International Conference on Frontiers in Handwriting Recognition, pp. 630–635 (2016)
Google Scholar
Sánchez, J.A., Romero, V., Toselli, A.H., Villegas, M., Vidal, E.: A set of benchmarks for handwritten text recognition on historical documents. Pattern Recogn. 94, 122–134 (2019)
Article Google Scholar
Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural networks. In: Proceedings of the 36th International Conference on Machine Learning, pp. 6105–6114 (2019)
Google Scholar
Toselli, A.H., Vidal, E.: Handwritten text recognition results on the Bentham collection with improved classical N-Gram-HMM methods. In: Proceedings of the 3rd International Workshop on Historical Document Imaging and Processing, pp. 15–22 (2015)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers & distillation through attention. In: Proceedings of the 38th International Conference on Machine Learning, pp. 10347–10357 (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Wick, C., Zöllner, J., Grüning, T.: Transformer for handwritten text recognition using bidirectional post-decoding. In: Proceedings of the Document Analysis and Recognition - International Conference on Document Analysis and Recognition, pp. 112–126 (2021)
Google Scholar
Zaheer, M., et al.: Big Bird: transformers for longer sequences. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 17283–17297 (2020)
Google Scholar

Download references

Acknowledgements

Work partially supported by the Universitat Politècnica de València under the PAID-01-22 programme, by grant PID2020-116813RB-I00 funded by MCIN/AEI/ 10.13039/501100011033, by the support of valgrAI - Valencian Graduate School and Research Network of Artificial Intelligence and the Generalitat Valenciana, and co-funded by the European Union.

Author information

Authors and Affiliations

PRHLT Research Center, Universitat Politècnica de València, Valencia, Spain
Daniel Parres & Roberto Paredes
Valencian Graduate School and Research Network of Artificial Intelligence, Camí de Vera s/n, 46022, Valencia, Spain
Roberto Paredes

Authors

Daniel Parres
View author publications
You can also search for this author in PubMed Google Scholar
Roberto Paredes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel Parres .

Editor information

Editors and Affiliations

TU Dortmund University, Dortmund, Germany
Gernot A. Fink
Adobe, College Park, MN, USA
Rajiv Jain
Osaka Metropolitan University, Osaka, Japan
Koichi Kise
Rochester Institute of Technology, Rochester, NY, USA
Richard Zanibbi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Parres, D., Paredes, R. (2023). Fine-Tuning Vision Encoder–Decoder Transformers for Handwriting Text Recognition on Historical Documents. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14190. Springer, Cham. https://doi.org/10.1007/978-3-031-41685-9_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-41685-9_16
Published: 19 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41684-2
Online ISBN: 978-3-031-41685-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Fine-Tuning Vision Encoder–Decoder Transformers for Handwriting Text Recognition on Historical Documents

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Handwritten Document Recognition Using Pre-trained Vision Transformers

Refocus attention span networks for handwriting line recognition

Recognizing text lines in handwritten archival document images using octave convolutional and attention recurrent neural networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

Fine-Tuning Vision Encoder–Decoder Transformers for Handwriting Text Recognition on Historical Documents

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Handwritten Document Recognition Using Pre-trained Vision Transformers

Refocus attention span networks for handwriting line recognition

Recognizing text lines in handwritten archival document images using octave convolutional and attention recurrent neural networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation