Leveraging Transformer-Based OCR Model with Generative Data Augmentation for Engineering Document Recognition †
Abstract
:1. Introduction
- We employed a deep generative model to learn the outdated fonts and styles in MSC data and utilized the trained model to augment the limited MSC data for fine-tuning the two off-the-shelf OCR models. We demonstrated that the augmented data can effectively improve OCR performance.
2. Related Work
2.1. Text Detection
2.2. Printed Text OCR
2.3. Handdwritten Text OCR
2.4. Transformer-Based OCR Systems
2.5. OCR for Engineering Documents
3. Proposed Methods
3.1. Text Detection
3.2. ScrabbleGAN for Data Augmentation
3.3. Text Recognition
3.3.1. CRNN-Based Architecture
3.3.2. Transformer-Based Architecture
3.3.3. Transfer Learning Through Fine-Tuning
3.4. Competing Methods
3.4.1. Pre-Trained Tesseract
3.4.2. Pre-Trained EasyOCR
3.4.3. Pre-Trained KerasOCR
3.4.4. Pre-Trained TrOCR
- Pre-trained TrOCR small (https://huggingface.co/microsoft/trocr-small-printed, accessed on 20 June 2024): This variant comprises 62 M parameters and employs the visual transformer [53] (12 layers with 384 hidden sizes and 6 attention heads) for the encoder architecture. Furthermore, the MiniLM [54] transformer (a lightweight language model released by Microsoft consisting of 6 layers, 256 hidden sizes, and 8 attention heads) is used for the decoder architecture.
- Pretrained TrOCR large (https://huggingface.co/microsoft/trocr-large-printed, accessed on 20 June, 2024): This variant comprises 558 M parameters and employs () the visual transformer [48] (24 layers, 1024 hidden sizes, and 16 attention heads) for the encoder architecture. A large RoBERTa transformer [49] (12 layers, 1024 hidden sizes, and 16 attention heads) is used for the decoder architecture.
3.5. Evaluation Metrics
4. Experiment Setup
4.1. Dataset
4.2. Data Annotation
4.3. Experiments
5. Experiment Results
5.1. Results by Pre-Trained Models on MSC Data
5.2. Results of Data Augmentation Using MSC Documents
5.3. Results by Fine-Tuned Models on MSC Data
- MSC fine-tuning dataset (3734 word images): This dataset includes the remaining 80% of the annotated set cropped from MSC data.
- Augmented MSC fine-tuning dataset (6734 word images): These data include these 3000 synthetic word images produced by ScrabbleGAN in addition to the MSC training dataset.
5.4. Results by Fine-Tuned Models on AirCorps Data
- AirCorps fine-tuning dataset (803 word images): This dataset includes the remaining 80% of the annotated set cropped from AirCorps library documents.
- Augmented AirCorps fine-tuning dataset (1802 word images): This dataset includes 1000 synthetic word images generated by the ScrabbleGAN model trained with the MSC dataset in addition to the AirCorps fine-tuning dataset.
5.5. Case Study for OCR on MSC Documents
6. Discussion
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Henderson, K.; Salado, A. Value and benefits of model-based systems engineering (MBSE): Evidence from the literature. Syst. Eng. 2021, 24, 51–66. [Google Scholar] [CrossRef]
- Shevchenko, N. An Introduction to Model-Based Systems Engineering (MBSE); Carnegie Mellon University, Software Engineering Institute’s Insights (blog): Pittsburgh, PA, USA, 2020. [Google Scholar]
- Department of Defense. Digital Engineering Strategy; Office of the Deputy Assistant Secretary of Dense for Systems Engineering: Washington, DC, USA, 2018.
- Lin, Y.H.; Ting, Y.H.; Huang, Y.C.; Cheng, K.L.; Jong, W.R. Integration of Deep Learning for Automatic Recognition of 2D Engineering Drawings. Machines 2023, 11, 802. [Google Scholar] [CrossRef]
- Memon, J.; Sami, M.; Khan, R.A.; Uddin, M. Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR). IEEE Access 2020, 8, 142642–142668. [Google Scholar] [CrossRef]
- Kumar, P.; Revathy, S. An automated invoice handling method using OCR. In Data Intelligence and Cognitive Informatics: Proceedings of ICDICI 2020; Springer: Berlin/Heidelberg, Germany, 2021; pp. 243–254. [Google Scholar]
- Yindumathi, K.; Chaudhari, S.S.; Aparna, R. Analysis of image classification for text extraction from bills and invoices. In Proceedings of the 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kharagpur, India, 1–3 July 2020; pp. 1–6. [Google Scholar]
- Shambharkar, Y.; Salagrama, S.; Sharma, K.; Mishra, O.; Parashar, D. An automatic framework for number plate detection using ocr and deep learning approach. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 8–14. [Google Scholar] [CrossRef]
- Vedhaviyassh, D.; Sudhan, R.; Saranya, G.; Safa, M.; Arun, D. Comparative analysis of easyocr and tesseractocr for automatic license plate recognition using deep learning algorithm. In Proceedings of the 2022 6th International Conference on Electronics, Communication and Aerospace Technology, Coimbatore, India, 1–3 December 2022; pp. 966–971. [Google Scholar]
- Shashidhar, R.; Manjunath, A.; Kumar, R.S.; Roopa, M.; Puneeth, S. Vehicle number plate detection and recognition using yolo-v3 and ocr method. In Proceedings of the 2021 IEEE International Conference on Mobile Networks and Wireless Communications (ICMNWC), Tumkur, India, 3–4 December 2021; pp. 1–5. [Google Scholar]
- Yu, D.; Li, X.; Zhang, C.; Liu, T.; Han, J.; Liu, J.; Ding, E. Towards accurate scene text recognition with semantic reasoning networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12113–12122. [Google Scholar]
- Huang, M.; Liu, Y.; Peng, Z.; Liu, C.; Lin, D.; Zhu, S.; Yuan, N.; Ding, K.; Jin, L. Swintextspotter: Scene text spotting via better synergy between text detection and text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4593–4603. [Google Scholar]
- Ye, J.; Chen, Z.; Liu, J.; Du, B. TextFuseNet: Scene Text Detection with Richer Fused Features. IJCAI 2020, 20, 516–522. [Google Scholar]
- Long, S.; He, X.; Yao, C. Scene text detection and recognition: The deep learning era. Int. J. Comput. Vis. 2021, 129, 161–184. [Google Scholar] [CrossRef]
- He, M.; Liao, M.; Yang, Z.; Zhong, H.; Tang, J.; Cheng, W.; Yao, C.; Wang, Y.; Bai, X. MOST: A multi-oriented scene text detector with localization refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8813–8822. [Google Scholar]
- AVMAC LLC. AVMAC Awarded Military Sealift Command Subcontract. 2012. Available online: https://avmacllc.com/avmac-awarded-military-sealift-command-subcontract/ (accessed on 19 December 2024).
- Khallouli, W.; Pamie-George, R.; Kovacic, S.; Sousa-Poza, A.; Canan, M.; Li, J. Leveraging transfer learning and GAN models for OCR from engineering documents. In Proceedings of the 2022 IEEE World AI IoT Congress (AIIoT), Seattle, WA, USA, 6–9 June 2022; pp. 15–21. [Google Scholar]
- Li, M.; Lv, T.; Chen, J.; Cui, L.; Lu, Y.; Florencio, D.; Zhang, C.; Li, Z.; Wei, F. Trocr: Transformer-based optical character recognition with pre-trained models. AAAI Conf. Artif. Intell. 2023, 37, 13094–13102. [Google Scholar] [CrossRef]
- Kerasocr. Available online: https://keras-ocr.readthedocs.io/en/latest/ (accessed on 19 December 2024).
- Ranjan, A.; Behera, V.N.J.; Reza, M. Ocr using computer vision and machine learning. Mach. Learn. Alg. Ind. Appl. 2021, 2021, 83–105. [Google Scholar]
- Philips, J.; Tabrizi, N. Historical Document Processing: A Survey of Techniques, Tools, and Trends. KDIR 2020, 2020, 341–349. [Google Scholar]
- Lubna; Mufti, N.; Shah, S.A.A. Automatic number plate Recognition: A detailed survey of relevant algorithms. Sensors 2021, 21, 3028. [Google Scholar] [CrossRef] [PubMed]
- Antonio, J.; Putra, A.R.; Abdurrohman, H.; Tsalasa, M.S. A Survey on Scanned Receipts OCR and Information Extraction. In Proceedings of the International Conference on Document Analysis and Recognition, Jerusalem, Israel, 29–30 November 2022; pp. 29–30. [Google Scholar]
- Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; Liang, J. East: An efficient and accurate scene text detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5551–5560. [Google Scholar]
- Baek, Y.; Lee, B.; Han, D.; Yun, S.; Lee, H. Character region awareness for text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9365–9374. [Google Scholar]
- Shi, B.; Bai, X.; Yao, C. An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2298–2304. [Google Scholar] [CrossRef] [PubMed]
- EasyOCR. Available online: https://github.com/JaidedAI/EasyOCR (accessed on 25 September 2024).
- Liao, M.; Shi, B.; Bai, X.; Wang, X.; Liu, W. TextBoxes: A Fast Text Detector with a Single Deep Neural Network. In Proceedings of the AAAI, San Francisco, CA USA, 4–9 February 2017. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
- Tesseract. Available online: https://github.com/tesseract-ocr/tesseract (accessed on 19 December 2024).
- GoogleCloud. Detect Text in Images. Available online: https://cloud.google.com/vision/docs/ocr (accessed on 19 December 2024).
- docTR. docTR: Document Text Recognition. Available online: https://mindee.github.io/doctr/ (accessed on 19 December 2024).
- Breuel, T.M. The OCRopus open source OCR system. In Proceedings of the Document Recognition and Retrieval XV, San Jose, CA, USA, 30–31 January 2008; International Society for Optics and Photonics: Bellingham, WA, USA, 2008; Volume 6815, p. 68150F. [Google Scholar]
- Sanyam. PaddleOCR: Unveiling the Power of Optical Character Recognition. 2022. Available online: https://learnopencv.com/optical-character-recognition-using-paddleocr/ (accessed on 19 December 2024).
- Puigcerver, J. Are multidimensional recurrent layers really necessary for handwritten text recognition? In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 67–72. [Google Scholar]
- de Sousa Neto, A.F.; Bezerra, B.L.D.; Toselli, A.H.; Lima, E.B. HTR-Flor: A deep learning system for offline handwritten text recognition. In Proceedings of the 2020 33rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Porto de Galinhas, Brazil, 7–10 November 2020; pp. 54–61. [Google Scholar]
- Kass, D.; Vats, E. AttentionHTR: Handwritten text recognition based on attention encoder-decoder networks. In International Workshop on Document Analysis Systems; Springer: Berlin/Heidelberg, Germany, 2022; pp. 507–522. [Google Scholar]
- Ly, N.T.; Nguyen, H.T.; Nakagawa, M. 2D self-attention convolutional recurrent network for offline handwritten text recognition. In International Conference on Document Analysis and Recognition; Springer: Berlin/Heidelberg, Germany, 2021; pp. 191–204. [Google Scholar]
- Fujitake, M. Dtrocr: Decoder-only transformer for optical character recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 8025–8035. [Google Scholar]
- Kim, G.; Hong, T.; Yim, M.; Nam, J.; Park, J.; Yim, J.; Hwang, W.; Yun, S.; Han, D.; Park, S. Ocr-free document understanding transformer. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 498–517. [Google Scholar]
- Villena Toro, J.; Wiberg, A.; Tarkian, M. Optical character recognition on engineering drawings to achieve automation in production quality control. Front. Manuf. Technol. 2023, 3, 1154132. [Google Scholar] [CrossRef]
- Saba, A.; Hantach, R.; Benslimane, M. Text Detection and Recognition from Piping and Instrumentation Diagrams. In Proceedings of the 2023 8th International Conference on Image, Vision and Computing (ICIVC), Dalian, China, 27–29 July 2023; pp. 711–716. [Google Scholar]
- Ren, Y.; Yao, H.; Liu, G.; Bai, Z. A text code recognition and positioning system for engineering drawings of nuclear power equipment. In Proceedings of the 2022 IEEE 6th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 4–6 March 2022; Volume 6, pp. 661–665. [Google Scholar]
- Keras Implementation of Convolutional Recurrent Neural Network. Available online: https://github.com/janzd/CRNN (accessed on 19 December 2024).
- Fogel, S.; Averbuch-Elor, H.; Cohen, S.; Mazor, S.; Litman, R. Scrabblegan: Semi-supervised varying length handwritten text generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4324–4333. [Google Scholar]
- Yadav, A.; Singh, S.; Siddique, M.; Mehta, N.; Kotangale, A. OCR using CRNN: A Deep Learning Approach for Text Recognition. In Proceedings of the 2023 4th International Conference for Emerging Technology (INCET), Belgaum, India, 26–28 May 2023; pp. 1–6. [Google Scholar]
- Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
- Bao, H.; Dong, L.; Piao, S.; Wei, F. Beit: Bert pre-training of image transformers. arXiv 2021, arXiv:2106.08254. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Weiss, K.; Khoshgoftaar, T.M.; Wang, D. A survey of transfer learning. J. Big Data 2016, 3, 1–40. [Google Scholar] [CrossRef]
- Akhil, S. An Overview of Tesseract OCR Engine; A seminar report; Department of Computer Science and Engineering National Institute of Technology: Calicut, India, 2016. [Google Scholar]
- Huang, Z.; Chen, K.; He, J.; Bai, X.; Karatzas, D.; Lu, S.; Jawahar, C. Icdar2019 competition on scanned receipt ocr and information extraction. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 1516–1520. [Google Scholar] [CrossRef]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
- Wang, W.; Wei, F.; Dong, L.; Bao, H.; Yang, N.; Zhou, M. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Adv. Neural Inf. Process. Syst. 2020, 33, 5776–5788. [Google Scholar]
OCR Model | Strategy | (%) | (%) | ||
---|---|---|---|---|---|
Tesseract [30] | pre-trained | - | 13.14 | 32.34 | 1.2166 |
EasyOCR [27] | pre-trained | - | 10.43 | 30.09 | 0.6125 |
KerasOCR [19] | pre-trained | - | 6.65 | 21.35 | 0.3724 |
TrOCR (small) [18] | pre-trained | - | 5.54 | 20.89 | 0.3041 |
TrOCR (large) [18] | pre-trained | - | 3.52 | 12.47 | 0.1931 |
KerasOCR w MSC [17] | fine-tuned | 8.42 | 3.17 | 11.21 | 0.1494 |
KerasOCR w Aug [17] | fine-tuned | 23.82 | 2.55 | 7.9 | 0.1419 |
TrOCR (small) w MSC | fine-tuned | 348.52 | 1.71 | 7.3 | 0.0939 |
TrOCR (small) w Aug | fine-tuned | 575.4 | 1.65 | 6.18 | 0.0907 |
TrOCR (large) w MSC | fine-tuned | 645.36 | 1.30 | 4.37 | 0.0715 |
TrOCR (large) w Aug | fine-tuned | 1311.2 | 0.09 | 3.94 | 0.005 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Khallouli, W.; Uddin, M.S.; Sousa-Poza, A.; Li, J.; Kovacic, S. Leveraging Transformer-Based OCR Model with Generative Data Augmentation for Engineering Document Recognition. Electronics 2025, 14, 5. https://doi.org/10.3390/electronics14010005
Khallouli W, Uddin MS, Sousa-Poza A, Li J, Kovacic S. Leveraging Transformer-Based OCR Model with Generative Data Augmentation for Engineering Document Recognition. Electronics. 2025; 14(1):5. https://doi.org/10.3390/electronics14010005
Chicago/Turabian StyleKhallouli, Wael, Mohammad Shahab Uddin, Andres Sousa-Poza, Jiang Li, and Samuel Kovacic. 2025. "Leveraging Transformer-Based OCR Model with Generative Data Augmentation for Engineering Document Recognition" Electronics 14, no. 1: 5. https://doi.org/10.3390/electronics14010005
APA StyleKhallouli, W., Uddin, M. S., Sousa-Poza, A., Li, J., & Kovacic, S. (2025). Leveraging Transformer-Based OCR Model with Generative Data Augmentation for Engineering Document Recognition. Electronics, 14(1), 5. https://doi.org/10.3390/electronics14010005