Leveraging Transformer-Based OCR Model with Generative Data Augmentation for Engineering Document Recognition

Khallouli, Wael; Uddin, Mohammad Shahab; Sousa-Poza, Andres; Li, Jiang; Kovacic, Samuel

doi:10.3390/electronics14010005

Open AccessArticle

Leveraging Transformer-Based OCR Model with Generative Data Augmentation for Engineering Document Recognition^†

by

Wael Khallouli

¹

,

Mohammad Shahab Uddin

²

,

Andres Sousa-Poza

¹,

Jiang Li

^2,*

and

Samuel Kovacic

^1,*

¹

Department of Engineering Management & Systems Engineering, Old Dominion University, Norfolk, VA 23529, USA

²

Department of Electrical & Computer Engineering, Old Dominion University, Norfolk, VA 23529, USA

^*

Authors to whom correspondence should be addressed.

^†

This paper is a revised and expanded version of our paper at the IEEE World AI IoT Congress (AIIoT) 2022 with the title “Leveraging transfer learning and GAN models for OCR from engineering documents”.

Electronics 2025, 14(1), 5; https://doi.org/10.3390/electronics14010005

Submission received: 27 September 2024 / Revised: 10 December 2024 / Accepted: 18 December 2024 / Published: 24 December 2024

(This article belongs to the Special Issue Deep Learning in Video and Image Processing: Challenges, Solutions, and Future Directions)

Download

Browse Figures

Versions Notes

Abstract

:

The long-standing practice of document-based engineering has resulted in the accumulation of a large number of engineering documents across various industries. Engineering documents, such as 2D drawings, continue to play a significant role in exchanging information and sharing knowledge across multiple engineering processes. However, these documents are often stored in non-digitized formats, such as paper and portable document format (PDF) files, making automation difficult. As digital engineering transforms processes in many industries, digitizing engineering documents presents a crucial challenge that requires advanced methods. This research addresses the problem of automatically extracting textual content from non-digitized legacy engineering documents. We introduced an optical character recognition (OCR) system for text detection and recognition that leverages transformer-based generative deep learning models and transfer learning approaches to enhance text recognition accuracy in engineering documents. The proposed system was evaluated on a dataset collected from ships’ engineering drawings provided by a U.S. agency. Experimental results demonstrated that the proposed transformer-based OCR model significantly outperformed pretrained off-the-shelf OCR models.

Keywords:

digital engineering; text recognition; deep learning; transfer learning; transformers; deep generative adversarial networks

1. Introduction

Systems engineering is shifting away from a traditional document-based approach to a model-based approach. The adoption of model-based systems engineering (MBSE) is increasing among industry and government practitioners as systems continue to grow in complexity [1]. MBSE is a formalized methodology that supports the requirements, design, analysis, verification, and validation associated with the development of complex systems [2]. In contrast to the traditional document-based approach, MBSE uses system models as the primary artifacts of the systems engineering process. Digital engineering (DT) has empowered the paradigm shift to MBSE by providing the digital infrastructure to design, develop, operate, and sustain complex systems. By leveraging models to connect data across all lifecycle stages using computer-based methods, processes, and tools (MPTs) and incorporating technological innovations such as artificial intelligence and advanced analytics, DT addresses the challenges associated with the complexity, uncertainty, and rapidly changing requirements of modern systems [3]. Digitalization is a key defining feature in digital engineering. In a digital engineering ecosystem, data availability in a machine-readable (digital) format is necessary to gain the advantage of digital technologies, enable data and model sharing, and enhance information traceability.

The long-standing tradition of using document-based engineering has led to the accumulation of a substantial number of legacy engineering documents across several industries. These documents are often available in scanned or paper-based formats and contain valuable knowledge about legacy systems, including technical details, drawings, and configuration data, among other information. Furthermore, engineering documents play a significant role in many engineering processes, such as quality control [4]. In the MSBE environment, it is still common practice to produce various engineering documents that capture information, such as scanned PDFs, spreadsheets, and informal drawings, throughout the system’s lifecycle, making it a document-rich environment. Since manually digitizing or inspecting engineering documents is time-consuming and resource-intensive, there is a growing need to develop automated tools for extracting information from these documents.

Optical character recognition (OCR) is a key digitization technology. OCR refers to converting printed and handwritten text into a digital format that is easy to extract, edit, search, and analyze [5]. Researchers have proposed robust solutions across many application domains and successfully automated many business tasks, such as automatic invoice reading (e.g., [6,7]), plate recognition (e.g., [8,9,10]), scene text detection, and recognition (e.g., [11,12,13,14,15]). However, extracting text from legacy engineering documents is challenging. In many industries, such as shipbuilding, it is common for information to be stored in noisy, low-quality images or outdated fonts in scanned PDFs. Extracting information with existing OCR systems often results in remarkably low accuracy, causing significant information loss.

This research addresses the problem of extracting text from engineering documents through a case study from the Military Sealift Command (MSC). The MSC has extensive experience in operating and maintaining ships. Along the way, it has accumulated a large number of engineering and maintenance documents, including technical updates about its vessels, stored in various formats such as manuals, bulletins, drawings, and other engineering document data. Soft and hard copies of these documents are stored in various electronic formats and maintained in a central repository called the Virtual Technical Library (VTL) [16]. As part of their DoD digital transformation efforts, the VTL is undergoing a digitalization process. One of the challenges faced by MSC was the difficulty of extracting textual content from these documents. MSC documents contain different types of text (printed and handwritten text), font styles, and sizes (Figure 1, Figure 2 and Figure 3). Some fonts used in these documents are outdated and irregular, making the text hard to recognize. Furthermore, the background noise present in some MSC engineering documents complicates the OCR process. This paper builds on our previous research [17], where we explored various deep learning and OCR models to extract textual content from MSC engineering documents. We proposed an OCR framework comprising a fine-tuned OCR model for text recognition and a generative deep learning model for data augmentation, and partial preliminary results were presented as a conference paper [17]. Our main contributions are:

We annotated a dataset from MSC engineering documents and fine-tuned two state-of-the-art OCR systems, including a transformer-based OCR model, TrOCR [18], and a CRNN-based OCR model, KerasOCR [19] for improved OCR performance on the MSC documents.
We employed a deep generative model to learn the outdated fonts and styles in MSC data and utilized the trained model to augment the limited MSC data for fine-tuning the two off-the-shelf OCR models. We demonstrated that the augmented data can effectively improve OCR performance.

The rest of this paper is organized as follows. Section 2 discusses related work. Section 3 describes our proposed approach. Section 4 presents the data and experimental setup. Section 5 provides the experimental results. Section 6 discusses the main findings, limitations, and implications of this research. Finally, Section 7 concludes the paper and outlines future work.

2. Related Work

OCR is a well-established area of research in machine learning [20]. An OCR system converts scanned documents or input text images into machine-editable and searchable formats. OCR has been successfully applied across various domains, including historical document processing [21], scene text detection and recognition [14], license plate recognition [22], and invoice and receipt processing [23], among others. The advent of deep learning has paved the way for the development of numerous models for text detection and recognition, such as EAST [24] and CRAFT [25] for text detection, and CRNN-CTC [26] and sequence-to-sequence models [18] for text recognition. The main components of an OCR system typically include preprocessing, text detection, text recognition, and postprocessing. OCR systems can be categorized into two types: (1) printed text recognition and (2) handwritten text recognition. Printed text recognition involves converting typed characters into digital text, while handwritten text recognition deals with converting handwritten text into a digital format.

2.1. Text Detection

Text detection involves localizing text regions in documents or images and is crucial for recognizing text in complex backgrounds. It can be performed at the line, word, or character level, with models typically producing bounding boxes around words or sentences. Earlier methods relied on handcrafted features, such as connected components analysis (CCA) [14], sliding windows, and segmentation-based approaches [25]. Recently, deep-learning-based algorithms have excelled in this task. CRAFT [25], widely used in systems like KerasOCR [19] and EasyOCR [27], detects individual character regions using a CNN (based on VGG-16) to generate region and affinity maps. These maps identify characters and the distances between them to form words. Similarly, EAST [24] employs CNNs for pixel-wise text instance prediction, while TextBoxes [28] uses a single-shot multi-box detector (SSD) framework [29] for faster, real-time text detection.

2.2. Printed Text OCR

Many OCR systems focus on detecting and recognizing printed text, achieving high accuracy in real-world applications. Examples include Google Tesseract [30], Cloud Vision OCR [31], EasyOCR [27], KerasOCR [19], and docTR API [32]. Most of these models rely on deep learning, with the CRNN architecture coupled with CTC being the preferred choice for text recognition. EasyOCR and KerasOCR are Python-based pipelines designed for scene text detection and recognition, utilizing CRAFT [25] for text detection and CRNN [26] for recognition. Tesseract [30] is a modular framework where text recognition involves preprocessing, layout analysis, segmentation, feature extraction, and recognition, using a bidirectional LSTM in its latest version. OCRopy [33] recognizes text at the line level, improving accuracy with a statistical language model and employing a single-layer bidirectional LSTM with CTC loss. PaddleOCR [34] provides pretrained models for multiple languages, following a similar architecture to EasyOCR and KerasOCR. It includes modules for text detection, orientation classification, and character recognition.

2.3. Handdwritten Text OCR

Handwritten text recognition has received significant attention in the literature, with researchers primarily using neural network architectures like CRNN. This task is more challenging than printed text recognition due to the variety of writing styles. Puigcerver [35] proposed a CRNN-based model with a single convolutional layer followed by one-dimensional recurrent layers, outperforming many traditional models. Flor de Sousa Neto et al. [36] introduced a Gated-CNN model that, with fewer parameters, achieved 33% better performance on benchmark datasets. Kass et al. [37] proposed an attention-based sequence-to-sequence model using a CRNN with ResNet for feature extraction and a bidirectional LSTM for sequence modeling, incorporating transfer learning from scene text recognition to address data limitations. Ly et al. [38] introduced a self-attention convolutional recurrent network (2D-SACRN) combining a self-attention-based feature extractor with recurrent layers and a CTC decoder, achieving competitive results with reduced training time.

2.4. Transformer-Based OCR Systems

Most OCR approaches for both handwritten and printed text recognition are based on the CRNN architecture. However, with the recent success of transformer-based models, there is a growing shift toward transformers for OCR. Microsoft’s TrOCR [18] is a transformer-based model with two components: a visual transformer for feature extraction and a textual transformer for text recognition. TrOCR was pre-trained on large datasets of both printed and handwritten text. Transformer-based OCR models typically follow an encoder–decoder architecture. Fujitake [39] introduced DTrOCR, a model that relies solely on a text decoder. It demonstrated superior performance over previous models, including TrOCR, in various OCR tasks. Kim et al. [40] proposed Donut, an OCR-free transformer model designed for document understanding. Donut directly maps raw document inputs to text outputs without the traditional OCR pipeline, showing competitive results across multiple datasets.

2.5. OCR for Engineering Documents

OCR has been widely applied in areas like scene text recognition and historical document digitization. However, its use for digitizing legacy engineering documentation has received less attention, especially in text recognition from engineering drawings, which remain crucial in the manufacturing industry [4]. Several studies have tackled this issue. Toro et al. [41] developed eDOCr, an OCR framework for extracting information from mechanical engineering drawings to automate quality control. Their approach uses image segmentation, CRAFT for text detection, and KerasOCR for recognition, which was trained on an extended alphabet including special characters and symbols. Saba et al. [42] proposed a two-stage pipeline for piping and instrumentation diagrams (P&IDs), employing EAST for text detection and EasyOCR for recognition, showing strong performance on P&IDs. Lin et al. [4] introduced a system using YOLO for feature extraction and Tesseract for recognizing text, symbols, and numbers, achieving a 70% character recognition rate. Ren et al. [43] addressed text recognition in nuclear power equipment drawings using EAST and PaddleOCR, with promising results. In our previous work, we fine-tuned a CRNN-based KerasOCR model [44] on a dataset of 2D ship drawings and maintenance documents from MSC, significantly improving recognition accuracy over Tesseract and EasyOCR.

3. Proposed Methods

The proposed OCR framework consists of four modules: (1) preprocessing, (2) text detection, (3) data augmentation, and (4) text recognition, as shown in Figure 4. It processes an engineering document and outputs the recognized text in a digital format (e.g., text or JSON). The preprocessing step enhances document quality before applying detection and recognition algorithms. Traditional techniques like binarization, noise removal, and image scaling have been widely used, while deep learning models are increasingly employed for tasks like noise removal. In this research, we used basic preprocessing techniques, leaving more advanced methods for future work. The text detection module identifies textual regions and outputs bounding boxes around words or sentences. The cropped text is then passed to the recognition module, which produces the final output. We experimented with various deep learning models at each stage, training the components separately.

3.1. Text Detection

This stage focuses on localizing text within the input engineering documents. The ships’ engineering documents used in this research contain various structures (e.g., tables, drawings, etc.). Text detection identifies the areas where text is present and outputs a set of word-level or sentence-level bounding boxes. We qualitatively evaluated two well-known text detection models: EAST [24] and CRAFT [25]. Based on this evaluation, we selected CRAFT as the text detector, as it provided superior text localization performance for our task compared to the EAST model.

3.2. ScrabbleGAN for Data Augmentation

OCR systems typically require a substantial amount of labeled data for training to achieve competitive performance. The text in our documents appears in various formats, fonts, and styles. Off-the-shelf OCR systems, trained on public datasets, struggled to generalize to the unseen fonts and styles in our data. To address this challenge, we utilize the ScrabbleGAN model [45], as shown in Figure 5, to augment our training data by mimicking the formats, fonts, and styles in our ground truth data. The key idea is to train ScrabbleGAN to transfer fonts and styles learned from our ground truth to synthetic word images. To achieve this goal, we first synthesize a set of words with standard formats, fonts, and styles, and then train ScrabbleGAN on our ground truth MSC documents to learn the specific characteristics. Finally, we use the trained model to transfer these learned styles to the synthetic data, augmenting our training dataset.

ScrabbleGAN consists of three main components, as shown in Figure 5: (1) a generator G, (2) a discriminator D, and (3) an OCR model R for text recognition, employing an adversarial training process. The generator G creates synthetic text images from a noise vector, while the discriminator D distinguishes between real text images (ground truth data) and those generated by the generator (mimicked styles). Unlike previous generative models that generate entire words or sentence-level images, ScrabbleGAN takes a different approach by training a generator for each character. The trained character-level generators are stored in a bank of learned filters, each corresponding to a specific element in the vocabulary. To generate a word image, the filters for each character in the word are selected and concatenated to produce the final word image. The OCR model R ensures the quality of the synthetic images by verifying the readability of the text. ScrabbleGAN uses the following loss function:

\begin{matrix} l = l_{D} + λ \times l_{R} \end{matrix}

(1)

where

l_{D}

and

l_{R}

are the loss terms of the discriminator module D and OCR module R, respectively.

3.3. Text Recognition

We consider two different architectures for text recognition: convolutional recurrent neural network (CRNN) and transformer. We fine-tune both models on the augmented data and compare their performance on the MSC dataset.

3.3.1. CRNN-Based Architecture

CRNN models [26,46] consist of three components, as illustrated in Figure 6: (1) convolutional layers for visual feature extraction from input images, (2) recurrent layers for sequence modeling and character classification, and (3) a transcription layer that translates the output of the sequence model into the final label sequence. KerasOCR [19] is a widely used end-to-end OCR pipeline for scene text detection and recognition applications. The text recognition module in this pipeline uses the CRNN-CTC architecture. The CRNN model in KerasOCR takes a word-level bounding box produced by the text detection module as input. This image is processed by CNN, which generates a sequence of feature maps over m timesteps. KerasOCR then employs a bidirectional long short-term memory (Bi-LSTM) sequence model to process these feature maps. The Bi-LSTM outputs a sequence of softmax probabilities over a given vocabulary at each timestep. In the transcription layer, a CTC loss function [47] is applied to decode the Bi-LSTM output and recognize the final text sequence within the bounding box.

3.3.2. Transformer-Based Architecture

Many OCR approaches use CNN models for image understanding and a sequence model for character-level text recognition. Additionally, language models are typically employed as a post-processing step to enhance the overall accuracy of the OCR output. In contrast, TrOCR [18], a transformer-based architecture proposed by Microsoft, leverages visual transformers for image understanding and text transformers for text generation. It follows an encoder–decoder architecture that combines the following components (Figure 7): (1) a pre-trained bidirectional encoder representation from image transformers (BEiT) [48] model for the encoder, and (2) a pre-trained RoBERTa [49] model (an optimized BERT language model) for the decoder. The model uses the attention mechanism of transformers to focus on relevant parts of the image, improving OCR output accuracy.

An input image is decomposed into a sequence of patches, and each patch is projected into a D-dimensional vector by BERT. Subsequently, the decoder uses a pre-trained RoBERTa model to recognize the input text by generating a sequence of text pieces from the visual features produced by the encoder. TrOCR was pre-trained in two stages. In the first stage, the model was trained on a large labeled dataset of 684 M printed text-line images cropped from publicly available PDF files. In the second stage, it was trained on smaller synthetic datasets for downstream tasks such as handwritten text recognition and receipt text recognition. The first dataset consists of 17.9 M handwritten text-line images generated from 5427 handwritten fonts, while the second dataset contains 1 M synthetic text-line images created using fonts from a set of receipt images [18]. TrOCR released pre-trained models from the different stages.

3.3.3. Transfer Learning Through Fine-Tuning

The performance of pre-trained OCR models is highly dependent on the quality of input data. In our case, these models struggled with the diverse textual styles and fonts in the engineering documents, which are not easily recognizable by models trained on standard public datasets. Transfer learning [50] (TL) allows a model trained on one task (

T_{s}

) to be adapted for another task (

T_{t}

), leveraging knowledge from a source domain (

D_{s}

) to improve performance in a target domain (

D_{t}

), particularly when labeled data in the target domain are limited. Fine-tuning is a common TL approach, where a pre-trained model is further trained on a new dataset specific to the target task. In this study, we fine-tuned the pre-trained KerasOCR (CRNN architecture) and TrOCR (transformer-based architecture) models using an annotated dataset from our case study on ships’ engineering documents.

3.4. Competing Methods

We compared the proposed (fine-tuned) models with several pre-trained OCR models.

3.4.1. Pre-Trained Tesseract

Tesseract [30] is one of the most widely used OCR engines for various applications. The latest versions of Tesseract (3.0 and beyond) have integrated a long short-term memory (LSTM) neural network to improve text recognition accuracy. Tesseract performs text recognition through several stages: (1) preprocessing, (2) segmentation, (3) feature extraction, and (4) classification [51]. In the preprocessing step, Tesseract uses adaptive thresholding (Otsu’s method) to convert the input document into a binary format. It then performs page layout analysis to segment the input document (e.g., a book page) into text blocks, followed by detecting text lines within these blocks. In the final stage, Tesseract employs a deep learning framework (LSTM) to recognize text within the detected lines. The Tesseract model was pre-trained on a dataset of 400 K text lines, encompassing 4500 different fonts.

3.4.2. Pre-Trained EasyOCR

EasyOCR [27] is a Python-based OCR pipeline primarily designed for scene text detection and recognition applications. EasyOCR uses the CRAFT algorithm [25] to detect text within input images and CRNN for text recognition. It leverages the PyTorch library and OpenCV on the back end. The text recognition module in EasyOCR was pre-trained on a hybrid dataset comprising 800 K natural scene images augmented by 9 M randomly generated synthetic images.

3.4.3. Pre-Trained KerasOCR

KerasOCR is also a Python-based OCR package for scene text detection and recognition applications. KerasOCR includes an end-to-end training pipeline to build OCR models. Like EasyOCR, this model implements CRAFT [25] for text detection and CRNN for text recognition. The KerasOCR text recognition model was pre-trained on 90 K synthetic scene images.

3.4.4. Pre-Trained TrOCR

We used TrOCR models, initially trained on a large collection of printed text line images (first stage) and then on the SROIE (Scanned Receipts OCR and Information Extraction) dataset [52] (second stage). We experimented with two pre-trained variants of TrOCR:

Pre-trained TrOCR small (https://huggingface.co/microsoft/trocr-small-printed, accessed on 20 June 2024): This variant comprises 62 M parameters and employs the $D e i T_{s m a l l}$ visual transformer [53] (12 layers with 384 hidden sizes and 6 attention heads) for the encoder architecture. Furthermore, the MiniLM [54] transformer (a lightweight language model released by Microsoft consisting of 6 layers, 256 hidden sizes, and 8 attention heads) is used for the decoder architecture.
Pretrained TrOCR large (https://huggingface.co/microsoft/trocr-large-printed, accessed on 20 June, 2024): This variant comprises 558 M parameters and employs ( $B E i T_{l a r g e}$ ) the visual transformer [48] (24 layers, 1024 hidden sizes, and 16 attention heads) for the encoder architecture. A large RoBERTa transformer [49] (12 layers, 1024 hidden sizes, and 16 attention heads) is used for the decoder architecture.

3.5. Evaluation Metrics

To evaluate the performance of each model, we used the following metrics: (1) training time per epoch (

t i m e_{e p o c h}

), (2) Character Error Rate (

C E R

), (3) Word Error Rate (

W E R

), and (4) Levenshtein distance (

L e v

). The training time per epoch records the time taken to train each OCR model for one epoch, defined as the ratio of the total training time to the number of epochs,

\begin{matrix} t i m e_{e p o c h} = \frac{t i m e}{n_{e p o c h s}} \end{matrix}

(2)

C E R

calculates the percentage of incorrectly identified characters relative to the total number of characters,

\begin{matrix} C E R = \frac{n_{e r r o r s}}{n_{c h a r a c t e r s}} \times 100 \end{matrix}

(3)

W E R

calculates the percentage of incorrectly identified words relative to the total number of words,

\begin{matrix} W E R = \frac{n_{e r r o r s}}{n_{w o r d s}} \times 100 \end{matrix}

(4)

The Levenshtein distance between a pair of words measures the smallest number of edits (insertions, deletions, or substitutions) needed to convert one word into another. We calculated the average Levenshtein distance for all predicted and actual word pairs in the test dataset, as shown in Equation (5).

\begin{matrix} A V R . L e v = \frac{1}{N_{t e s t}} \times \sum_{i = 1}^{N_{t e s t}} L e v (w_{i}^{p r e d}, w_{i}^{a c t u a l}) \end{matrix}

(5)

where

N_{t e s t}

is the size of the testing set and

w_{i}^{p r e d}

and

w_{i}^{a c t u a l}

represent the OCR prediction and its corresponding ground truth word, respectively.

4. Experiment Setup

4.1. Dataset

We created two datasets from engineering documents from the shipyard and aviation industries. The first dataset was manually annotated from nine ships’ maintenance documents provided by MSC. These documents include 2D ship drawings and textual content that describes important information about the design and maintenance of the ships. Each document contains hundreds of lines representing the various textual styles and fonts used in ships’ maintenance documentation. The dataset includes both low-quality and high-quality documents. Figure 1, Figure 2 and Figure 3 show three examples extracted from MSC engineering documents. The text font in the example in Figure 1 is relatively easy to recognize, while those in Figure 2 are more challenging. The example shown in Figure 3 contains multiple distinct fonts. The second dataset was collected from aircraft design engineering drawings available through AirCorps Library (https://app.aircorpslibrary.com/about, accessed on 22 October 2024), which is an online repository that hosts engineering drawings and technical manuals for WWII and legacy aircraft. An example of an AirCorps document is illustrated in Figure 8. Although this dataset comes from a different industry, its textual styles are similar to those found in the MSC documentation.

4.2. Data Annotation

These documents also include a variety of textual styles and, at times, unrecognized fonts, which pose significant challenges for OCR systems to achieve high recognition accuracy. We manually annotated two datasets of word-level images cropped from the MSC engineering documents and AirCorps library documents. Figure 9 shows a few examples of the annotated MSC dataset, while Figure 10 shows a few examples from AirCorps documents. The MSC dataset consists of 4671 word images cropped from different sections of the engineering documents, encompassing various fonts and styles commonly used in ships’ maintenance documentation. We used this dataset to evaluate the performance of our proposed OCR models in recognizing text with various fonts. The dataset was randomly divided into a training set (80%) and a testing set (20%), with 3734 word images used for training and 937 for testing. The AirCorps dataset consists of 1004 annotated word images. This dataset was randomly split into a training set (80%) and a testing set (20%), with 803 word images used for training and 201 for testing.

4.3. Experiments

We conducted three experiments in this study. In the first experiment, we compared the performance of fine-tuned TrOCR and KerasOCR models with a set of pre-trained OCR models (pre-trained KerasOCR, TrOCR, EasyOCR, and Tesseract) on MSC and AirCorps datasets. The goal was to evaluate the overall OCR performance of the transfer learning approach applied to the MSC and AirCorps documents. In the second experiment, we trained the ScabbleGAN model to augment our training dataset, and in the third experiment, we assessed the impact of the data augmentation procedure on OCR performance. We created two new datasets that included both the augmented word images and the manually annotated samples to fine-tune the KerasOCR and TrOCR models. This experiment aimed to evaluate the effect of the data augmentation technique introduced in this research on OCR performance for MSC and AirCorps documents.

5. Experiment Results

5.1. Results by Pre-Trained Models on MSC Data

Table 1 reports the text recognition performance measured using

W E R

,

C E R

, and Levenshtein distance metrics on MSC data. As expected, most of the pre-trained OCR systems did not perform well on our data. The pre-trained Tesseract system had the highest character and word error rates (13.14% and 32.34%, respectively), while the pre-trained TrOCR system (large model) achieved the lowest character and word error rates (3.52% and 12.47%, respectively) among the pre-trained models.

5.2. Results of Data Augmentation Using MSC Documents

We trained ScrabbleGAN to generate synthetic word images for training, to replicate the textual styles used in the provided engineering documents. All annotated data were used to train ScrabbleGAN. Training images were resized to

32 \times 16 n

(with the height set to 32 pixels and the width varying according to the number of characters, n, in the word). We trained the model using the default training parameters from the original model [45], except for the batch size, which was set to 10, and the learning rate, set to 0.0002 for all networks (R, G, and D). ScrabbleGAN was trained to produce uppercase English characters only. We used the trained generator G in the inference step to create 3000 synthetic word images. Figure 11 shows a few word-image samples from the synthetic data after 1000 epochs.

5.3. Results by Fine-Tuned Models on MSC Data

For the evaluation of the different fine-tuned OCR models, we used 20% of the annotated MSC data (937 word images from MSC documents) for testing. Then, we created these fine-tuning datasets:

MSC fine-tuning dataset (3734 word images): This dataset includes the remaining 80% of the annotated set cropped from MSC data.
Augmented MSC fine-tuning dataset (6734 word images): These data include these 3000 synthetic word images produced by ScrabbleGAN in addition to the MSC training dataset.

KerasOCR was fine-tuned using a batch size of 32, the Adam optimizer with a learning rate of 0.001, and trained for 50 epochs (with early stopping after 20 epochs). The same parameters were used to fine-tune KerasOCR on the augmented training set. The small TrOCR model was fine-tuned using the AdamW optimizer (https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html, accessed on 20 December 2024) with a learning rate of 0.00005, a batch size of 32, and 20 epochs. The large TrOCR model was fine-tuned using the AdamW optimizer with a learning rate of 0.00005, a batch size of 16, and 10 epochs. All the models were trained using an Nvidia A100 high-memory GPU provided by Old Dominion University (https://wiki.hpc.odu.edu/en/open-ondemand, accessed on 19 December 2024).

Results in Table 1 show that fine-tuning the KerasOCR model significantly improved its performance. The word error rate (

W E R

) was reduced from 21.35% to 11.21% and the character error rate (

C E R

) from 6.65% to 3.17% after fine-tuning with the MSC training dataset, and these further dropped to 7.9% and 2.55%, respectively, after fine-tuning with the augmented dataset. Fine-tuning the TrOCR (small) model with the MSC dataset improved the

W E R

from 20.89% to 7.3% and the

C E R

from 5.54% to 1.71%. After fine-tuning with the augmented dataset, the TrOCR (small) model continued to improve its

W E R

(6.18%) and

C E R

(1.65%). In contrast, fine-tuning the TrOCR (large) model with the MSC dataset improved

W E R

and

C E R

from 12.47% and 3.52% to 4.37% and 1.30%, respectively. However, fine-tuning the model with the augmented dataset also improved its performance, though not as much as the MSC dataset. The large TrOCR model has significantly more parameters, and we hypothesize that it will require higher-quality data to achieve better results. Overall, the transformer-based TrOCR (large), fine-tuned with the augmented MSC dataset, achieved the best results with a

W E R

of 3.94% and a

C E R

of 0.09%. Transformer-based models require more time for fine-tuning; since fine-tuning is a one-time task, it does not present a significant burden in practice.

5.4. Results by Fine-Tuned Models on AirCorps Data

We used 20% of the annotated AirCorps data (201 word images from AirCorps library documents) for testing and created these fine-tuning datasets:

AirCorps fine-tuning dataset (803 word images): This dataset includes the remaining 80% of the annotated set cropped from AirCorps library documents.
Augmented AirCorps fine-tuning dataset (1802 word images): This dataset includes 1000 synthetic word images generated by the ScrabbleGAN model trained with the MSC dataset in addition to the AirCorps fine-tuning dataset.

Note that we attempted to use the AirCorps fine-tuning dataset to train the ScrabbleGAN model for augmentation. However, the synthetic images were of very low quality, partially because the AirCorps fine-tuning dataset contains only 802 images, which is insufficient to effectively train the ScrabbleGAN model. As an alternative, we randomly selected 1000 synthetic images generated by the ScrabbleGAN model trained on the MSC fine-tuning dataset to fine-tune the OCR models for recognizing the AirCorps dataset. We did not use all 3000 synthetic images because doing so did not result in better outcomes.

The fine-tuning results for each OCR model on the AirCorps dataset are reported in Table 2. The table shows that fine-tuning the large variant of TrOCR with the augmented dataset improved performance by 4% in terms of word error rate. The synthetic training samples generated by the ScrabbleGAN model, trained on MSC data, enhanced OCR performance on the AirCorps dataset. This enhanced performance can be attributed to the domain similarity, as both datasets consist of historical engineering drawings from different industries. However, it should be noted that the smaller size of the AirCorps dataset resulted in higher word and character error rates compared to the MSC dataset. While the augmentation procedure has demonstrated its effectiveness in improving OCR performance with limited data availability, the findings suggest that larger annotated datasets remain essential for achieving optimal OCR results across different domains.

5.5. Case Study for OCR on MSC Documents

Figure 12, Figure 13 and Figure 14 show one high-quality example and two challenging examples of text detection and recognition tasks by different models. For the high-quality example in Figure 13, all models performed very well, with no missed words by any of the models, and the fine-tuned KerasOCR models even achieved a 100%

W E R

. The font used in this example was easy to recognize, with a clean background and no curved or irregular text, resulting in a low recognition error rate.

Figure 12 illustrates an example selected from a noisy MSC document. Overall, the fine-tuned TrOCR models provided better performance than the KerasOCR models. In some instances where the bounding box included more than one word (e.g., the word ‘PAINTING/COATING’), the fine-tuned TrOCR models failed to capture the complete text, unlike the pre-trained TrOCR model. This is due to the training strategy used in this research, which involved labeled word images, whereas the pre-trained TrOCR model was trained on text-line images. Moreover, the large TrOCR model was more robust to the noisy background than both the KerasOCR and small TrOCR models. For instance, the large TrOCR model accurately recognized the words ‘ALL’ and ‘NOTED,’ which were missed by other models. Future research will explore generative deep learning approaches for image enhancement. We expect that image enhancement and denoising techniques will further improve OCR performance on noisy documents.

Figure 14 presents another challenging example featuring an irregular and hard-to-recognize font. Both the fine-tuned and pre-trained KerasOCR models had low accuracy rates in this case. However, the fine-tuned TrOCR models were able to recognize several words, such as ‘BLEEDER’, ‘STEEL’, and ‘PLUG’. The large TrOCR model performed slightly better than the small model. Similar to the first example, the fine-tuned TrOCR models struggled with numerals. Although the large TrOCR model successfully identified some numerals, such as ‘101-001’, it failed to accurately recognize others, such as ‘9/16”’ and ‘1/2.’’ The proposed models also failed to recognize degraded words, such as ‘BLEEDER’ (top left). Degraded text is a common issue in many low-quality MSC documents. The current OCR solutions are inadequate for detecting such text, highlighting the need for further research in this area.

6. Discussion

In this study, we fine-tuned several pre-trained OCR models using an annotated dataset derived from MSC ships’ engineering documents. These documents are unique in that they contain text written in various styles (both handwritten and printed) and different fonts. The pre-trained OCR models evaluated in this study did not perform well in our application, primarily because they were generally trained on public benchmark datasets and are sensitive to the specific datasets used for training. Fine-tuning both KerasOCR and TrOCR improved OCR performance. We reduced the word error rate from 12.47% with the pre-trained large TrOCR model (trained on the SOIRE [52] dataset) to 4.37% with the fine-tuned large TrOCR model and 3.94% with the fine-tuned large TrOCR model with the augmented dataset. This improvement was even more significant with KerasOCR, reducing the word error rate by 10% and the character error rate by 3% compared to its pre-trained variant. Overall, we found that the transformer-based model (TrOCR) provides better OCR output compared to the CRNN-based model (KerasOCR). The pre-trained large version of TrOCR notably achieved better results than the other pre-trained models. This performance may be attributed to the pre-training dataset used for TrOCR, which was derived from the SROIE (Scanned Receipts OCR and Information Extraction) task. The proposed augmentation procedure was effective in improving the OCR performance of KerasOCR and TrOCR models on MSC data.

By analyzing several instances from MSC documents, we observed that OCR models using the CRNN architecture, such as KerasOCR, are more sensitive to the quality of input documents, while TrOCR demonstrated greater robustness to noise, unfamiliar fonts, and irregular text styles, even with fewer training epochs. However, TrOCR’s improved performance came with significant drawbacks. The training process for TrOCR, particularly the large model, was much more time-consuming and required specialized computing resources like high-memory GPUs, with KerasOCR being approximately 71 times faster. Despite TrOCR’s strong performance in detecting a wide range of fonts and handling noisy backgrounds, it still had limitations. The fine-tuned large TrOCR model struggled with identifying special characters, numerals, and degraded text, highlighting the need for further improvements in these areas. To address this challenge, data augmentation techniques can be employed to generate synthetic training samples, including a variety of numerals and special characters with diverse fonts, orientations, and noise levels. Additionally, it can be addressed by employing multi-task learning, where the OCR model is trained to jointly optimize OCR performance with secondary tasks, such as numeral or symbol classification.

Pre-trained models, such as TrOCR, provide a strong starting point for OCR tasks. However, their effectiveness is limited in domain-specific environments, such as those with unique font styles, as they are primarily pre-trained on generalized OCR tasks. Training an OCR model from scratch is a resource-intensive process that requires substantial amounts of annotated data. To address these challenges, several alternative approaches can be employed. For example, integrating active learning strategies, where the model is iteratively fine-tuned on high-uncertainty samples identified during inference, can improve its adaptability to unseen domains. Another promising approach is meta-learning (learning to learn), where the model learns from multiple tasks (e.g., engineering documents from various industries) and generalizes knowledge across domains. Meta-learning involves training at a higher level, enabling the model to adapt quickly to new tasks (domains).

Practical deployment of the proposed OCR systems presents significant challenges. As shown in our results, the transformer-based OCR model (TrOCR) incurs a higher computational cost compared to the traditional CRNN-based model (KerasOCR), primarily due to its larger number of parameters. While CRNN models are faster to retrain and simpler to deploy, they exhibit a higher error rate. To address the computational demands of TrOCR in real-world settings, several strategies can be employed. One effective approach is knowledge distillation, which transfers knowledge from a large, pre-trained “teacher” model to a smaller, more efficient “student” model [1]. This technique enables the student model to achieve accuracy comparable to the original TrOCR model while significantly reducing computational costs. Another strategy is partial fine-tuning, where only a subset of the TrOCR model’s layers—particularly the final layers—are fine-tuned, rather than retraining the entire model. This targeted fine-tuning minimizes resource requirements while effectively adapting the model to specific deployment contexts.

Finally, the limited availability of engineering documents poses a significant challenge for training the OCR models. Capturing the diversity of fonts and textual styles in our application requires more data, but collecting and annotating such data is time-consuming and resource-intensive, making it impractical for similar digitization efforts. To address this issue of data scarcity, we implemented a generative deep learning model, ScrabbleGAN, for data augmentation. This model was trained to generate synthetic images by replicating the fonts used in the original documents, providing a partial solution to the limited dataset problem. While this approach significantly improved the performance of the KerasOCR model, it had much less effect on the fine-tuned TrOCR model, partially because the transformer-based model already has very good results. We hypothesize that the large transformer model, with its significantly higher number of parameters, requires more high-quality data for effective fine-tuning. We will continue to explore new methods to increase the availability of data for fine-tuning. The variability of documents across industries (e.g., shipbuilding vs. automobile) underscores the necessity of adapting pre-trained models to unseen data, highlighting the importance of continued research to enhance OCR models.

7. Conclusions

In this study, we proposed a deep learning OCR pipeline that integrates transformers and generative models for text recognition, data augmentation, and image enhancement. Using a dataset of annotated engineering documents provided by MSC, we fine-tuned the text recognition module within our pipeline. The fine-tuning process significantly improved the OCR performance of both the pre-trained CRNN and transformer-based models, with the fine-tuned transformer model outperforming all competitors and demonstrating substantial gains. As our future work, we plan to explore cutting-edge deep generative models, such as diffusion models, to enhance document cleaning and reduce background noise. Although the augmentation procedure slightly improved the performance of the fine-tuned KerasOCR model, it had minimal impact on the transformer-based model. Future research will focus on refining our data augmentation techniques to generate more realistic and effective training images.

Author Contributions

Conceptualization, J.L., A.S.-P. and S.K.; methodology, W.K. and J.L.; software, M.S.U. and W.K.; validation, J.L. and W.K.; formal analysis, W.K. and J.L.; investigation, W.K., J.L., M.S.U. and S.K.; resources, S.K. and A.S.-P.; data curation, W.K., J.L. and M.S.U.; writing—original draft preparation, W.K.; writing—review and editing, W.K. and J.L.; visualization, W.K. and J.L.; supervision, A.S.-P., S.K. and J.L.; project administration, S.K. and J.L.; funding acquisition, S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research is based upon work supported, in whole or in part, by the U.S. Navy’s Military Sealift Command through CACI under sub-contract P000143798-3 and project 500481-003.

Data Availability Statement

Data are publicly unavailable due to privacy restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Henderson, K.; Salado, A. Value and benefits of model-based systems engineering (MBSE): Evidence from the literature. Syst. Eng. 2021, 24, 51–66. [Google Scholar] [CrossRef]
Shevchenko, N. An Introduction to Model-Based Systems Engineering (MBSE); Carnegie Mellon University, Software Engineering Institute’s Insights (blog): Pittsburgh, PA, USA, 2020. [Google Scholar]
Department of Defense. Digital Engineering Strategy; Office of the Deputy Assistant Secretary of Dense for Systems Engineering: Washington, DC, USA, 2018.
Lin, Y.H.; Ting, Y.H.; Huang, Y.C.; Cheng, K.L.; Jong, W.R. Integration of Deep Learning for Automatic Recognition of 2D Engineering Drawings. Machines 2023, 11, 802. [Google Scholar] [CrossRef]
Memon, J.; Sami, M.; Khan, R.A.; Uddin, M. Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR). IEEE Access 2020, 8, 142642–142668. [Google Scholar] [CrossRef]
Kumar, P.; Revathy, S. An automated invoice handling method using OCR. In Data Intelligence and Cognitive Informatics: Proceedings of ICDICI 2020; Springer: Berlin/Heidelberg, Germany, 2021; pp. 243–254. [Google Scholar]
Yindumathi, K.; Chaudhari, S.S.; Aparna, R. Analysis of image classification for text extraction from bills and invoices. In Proceedings of the 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kharagpur, India, 1–3 July 2020; pp. 1–6. [Google Scholar]
Shambharkar, Y.; Salagrama, S.; Sharma, K.; Mishra, O.; Parashar, D. An automatic framework for number plate detection using ocr and deep learning approach. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 8–14. [Google Scholar] [CrossRef]
Vedhaviyassh, D.; Sudhan, R.; Saranya, G.; Safa, M.; Arun, D. Comparative analysis of easyocr and tesseractocr for automatic license plate recognition using deep learning algorithm. In Proceedings of the 2022 6th International Conference on Electronics, Communication and Aerospace Technology, Coimbatore, India, 1–3 December 2022; pp. 966–971. [Google Scholar]
Shashidhar, R.; Manjunath, A.; Kumar, R.S.; Roopa, M.; Puneeth, S. Vehicle number plate detection and recognition using yolo-v3 and ocr method. In Proceedings of the 2021 IEEE International Conference on Mobile Networks and Wireless Communications (ICMNWC), Tumkur, India, 3–4 December 2021; pp. 1–5. [Google Scholar]
Yu, D.; Li, X.; Zhang, C.; Liu, T.; Han, J.; Liu, J.; Ding, E. Towards accurate scene text recognition with semantic reasoning networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12113–12122. [Google Scholar]
Huang, M.; Liu, Y.; Peng, Z.; Liu, C.; Lin, D.; Zhu, S.; Yuan, N.; Ding, K.; Jin, L. Swintextspotter: Scene text spotting via better synergy between text detection and text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4593–4603. [Google Scholar]
Ye, J.; Chen, Z.; Liu, J.; Du, B. TextFuseNet: Scene Text Detection with Richer Fused Features. IJCAI 2020, 20, 516–522. [Google Scholar]
Long, S.; He, X.; Yao, C. Scene text detection and recognition: The deep learning era. Int. J. Comput. Vis. 2021, 129, 161–184. [Google Scholar] [CrossRef]
He, M.; Liao, M.; Yang, Z.; Zhong, H.; Tang, J.; Cheng, W.; Yao, C.; Wang, Y.; Bai, X. MOST: A multi-oriented scene text detector with localization refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8813–8822. [Google Scholar]
AVMAC LLC. AVMAC Awarded Military Sealift Command Subcontract. 2012. Available online: https://avmacllc.com/avmac-awarded-military-sealift-command-subcontract/ (accessed on 19 December 2024).
Khallouli, W.; Pamie-George, R.; Kovacic, S.; Sousa-Poza, A.; Canan, M.; Li, J. Leveraging transfer learning and GAN models for OCR from engineering documents. In Proceedings of the 2022 IEEE World AI IoT Congress (AIIoT), Seattle, WA, USA, 6–9 June 2022; pp. 15–21. [Google Scholar]
Li, M.; Lv, T.; Chen, J.; Cui, L.; Lu, Y.; Florencio, D.; Zhang, C.; Li, Z.; Wei, F. Trocr: Transformer-based optical character recognition with pre-trained models. AAAI Conf. Artif. Intell. 2023, 37, 13094–13102. [Google Scholar] [CrossRef]
Kerasocr. Available online: https://keras-ocr.readthedocs.io/en/latest/ (accessed on 19 December 2024).
Ranjan, A.; Behera, V.N.J.; Reza, M. Ocr using computer vision and machine learning. Mach. Learn. Alg. Ind. Appl. 2021, 2021, 83–105. [Google Scholar]
Philips, J.; Tabrizi, N. Historical Document Processing: A Survey of Techniques, Tools, and Trends. KDIR 2020, 2020, 341–349. [Google Scholar]
Lubna; Mufti, N.; Shah, S.A.A. Automatic number plate Recognition: A detailed survey of relevant algorithms. Sensors 2021, 21, 3028. [Google Scholar] [CrossRef] [PubMed]
Antonio, J.; Putra, A.R.; Abdurrohman, H.; Tsalasa, M.S. A Survey on Scanned Receipts OCR and Information Extraction. In Proceedings of the International Conference on Document Analysis and Recognition, Jerusalem, Israel, 29–30 November 2022; pp. 29–30. [Google Scholar]
Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; Liang, J. East: An efficient and accurate scene text detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5551–5560. [Google Scholar]
Baek, Y.; Lee, B.; Han, D.; Yun, S.; Lee, H. Character region awareness for text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9365–9374. [Google Scholar]
Shi, B.; Bai, X.; Yao, C. An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2298–2304. [Google Scholar] [CrossRef] [PubMed]
EasyOCR. Available online: https://github.com/JaidedAI/EasyOCR (accessed on 25 September 2024).
Liao, M.; Shi, B.; Bai, X.; Wang, X.; Liu, W. TextBoxes: A Fast Text Detector with a Single Deep Neural Network. In Proceedings of the AAAI, San Francisco, CA USA, 4–9 February 2017. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Tesseract. Available online: https://github.com/tesseract-ocr/tesseract (accessed on 19 December 2024).
GoogleCloud. Detect Text in Images. Available online: https://cloud.google.com/vision/docs/ocr (accessed on 19 December 2024).
docTR. docTR: Document Text Recognition. Available online: https://mindee.github.io/doctr/ (accessed on 19 December 2024).
Breuel, T.M. The OCRopus open source OCR system. In Proceedings of the Document Recognition and Retrieval XV, San Jose, CA, USA, 30–31 January 2008; International Society for Optics and Photonics: Bellingham, WA, USA, 2008; Volume 6815, p. 68150F. [Google Scholar]
Sanyam. PaddleOCR: Unveiling the Power of Optical Character Recognition. 2022. Available online: https://learnopencv.com/optical-character-recognition-using-paddleocr/ (accessed on 19 December 2024).
Puigcerver, J. Are multidimensional recurrent layers really necessary for handwritten text recognition? In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 67–72. [Google Scholar]
de Sousa Neto, A.F.; Bezerra, B.L.D.; Toselli, A.H.; Lima, E.B. HTR-Flor: A deep learning system for offline handwritten text recognition. In Proceedings of the 2020 33rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Porto de Galinhas, Brazil, 7–10 November 2020; pp. 54–61. [Google Scholar]
Kass, D.; Vats, E. AttentionHTR: Handwritten text recognition based on attention encoder-decoder networks. In International Workshop on Document Analysis Systems; Springer: Berlin/Heidelberg, Germany, 2022; pp. 507–522. [Google Scholar]
Ly, N.T.; Nguyen, H.T.; Nakagawa, M. 2D self-attention convolutional recurrent network for offline handwritten text recognition. In International Conference on Document Analysis and Recognition; Springer: Berlin/Heidelberg, Germany, 2021; pp. 191–204. [Google Scholar]
Fujitake, M. Dtrocr: Decoder-only transformer for optical character recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 8025–8035. [Google Scholar]
Kim, G.; Hong, T.; Yim, M.; Nam, J.; Park, J.; Yim, J.; Hwang, W.; Yun, S.; Han, D.; Park, S. Ocr-free document understanding transformer. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 498–517. [Google Scholar]
Villena Toro, J.; Wiberg, A.; Tarkian, M. Optical character recognition on engineering drawings to achieve automation in production quality control. Front. Manuf. Technol. 2023, 3, 1154132. [Google Scholar] [CrossRef]
Saba, A.; Hantach, R.; Benslimane, M. Text Detection and Recognition from Piping and Instrumentation Diagrams. In Proceedings of the 2023 8th International Conference on Image, Vision and Computing (ICIVC), Dalian, China, 27–29 July 2023; pp. 711–716. [Google Scholar]
Ren, Y.; Yao, H.; Liu, G.; Bai, Z. A text code recognition and positioning system for engineering drawings of nuclear power equipment. In Proceedings of the 2022 IEEE 6th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 4–6 March 2022; Volume 6, pp. 661–665. [Google Scholar]
Keras Implementation of Convolutional Recurrent Neural Network. Available online: https://github.com/janzd/CRNN (accessed on 19 December 2024).
Fogel, S.; Averbuch-Elor, H.; Cohen, S.; Mazor, S.; Litman, R. Scrabblegan: Semi-supervised varying length handwritten text generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4324–4333. [Google Scholar]
Yadav, A.; Singh, S.; Siddique, M.; Mehta, N.; Kotangale, A. OCR using CRNN: A Deep Learning Approach for Text Recognition. In Proceedings of the 2023 4th International Conference for Emerging Technology (INCET), Belgaum, India, 26–28 May 2023; pp. 1–6. [Google Scholar]
Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
Bao, H.; Dong, L.; Piao, S.; Wei, F. Beit: Bert pre-training of image transformers. arXiv 2021, arXiv:2106.08254. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Weiss, K.; Khoshgoftaar, T.M.; Wang, D. A survey of transfer learning. J. Big Data 2016, 3, 1–40. [Google Scholar] [CrossRef]
Akhil, S. An Overview of Tesseract OCR Engine; A seminar report; Department of Computer Science and Engineering National Institute of Technology: Calicut, India, 2016. [Google Scholar]
Huang, Z.; Chen, K.; He, J.; Bai, X.; Karatzas, D.; Lu, S.; Jawahar, C. Icdar2019 competition on scanned receipt ocr and information extraction. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 1516–1520. [Google Scholar] [CrossRef]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Wang, W.; Wei, F.; Dong, L.; Bao, H.; Yang, N.; Zhou, M. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Adv. Neural Inf. Process. Syst. 2020, 33, 5776–5788. [Google Scholar]

Figure 1. An example of a high-quality MSC document with an easy-to-recognize font.

Figure 2. An example of an MSC document with a more challenging font to recognize.

Figure 3. An example of an MSC document with multiple fonts.

Figure 4. Framework of the proposed OCR pipeline. The MSC document first goes through a preprocessing step to enhance its quality. Subsequently, text detection and text recognition are performed on the enhanced document.

Figure 5. ScrabbleGAN architecture [45]: The generator creates a text image of ‘DWG’ using trained character generators. The discriminator (D) checks if the image looks realistic and if the font is transferred accurately. The text recognition network (R) ensures readability.

Figure 6. Convolutional recurrent neural network for text recognition: it consists of convolutional layers for visual feature extraction from input images, recurrent layers for sequence modeling and character classification, and a transcription layer for converting the output of the sequence model into the final label sequence.

Figure 7. TrOCR architecture [18]. The input image is divided into a sequence of patches, with each being projected into a D-dimensional vector by BERT. The RoBERTa model then recognizes the input text.

Figure 8. An example of an aircraft engineering drawing provided by AirCorps.

Figure 9. Examples of annotated word images from MSC documents.

Figure 10. Examples of annotated word images from AirCorps documents.

Figure 11. Synthetic samples generated by ScrabbleGAN [45].

Figure 12. Results by OCR models on a selected challenging document. (a) Pre-trained KerasOCR: 19 ✓ vs. 9 ×. (b) Pre-trained TrOCR (large): 22 ✓ vs. 6 ×. (c) Fine-tuned KerasOCR with MSC: 3 ✓ vs. 25 ×. (d) Fine-tuned KerasOCR with Aug: 10 ✓ vs. 18 ×. (e) Fine-tuned TrOCR (small) with MSC: 19 ✓ vs. 9 ×. (f) Fine-tuned TrOCR (large) with MSC: 24 ✓ vs. 4 ×.

Figure 13. Results by OCR models on a selected high-quality MSC document. (a) Pre-trained KerasOCR: 14 ✓ vs. 3 ×. (b) Pre-trained TrOCR (large): 16 ✓ vs. 1 ×. (c) Fine-tuned KerasOCR with MSC: 16 ✓ vs. 1 ×. (d) Fine-tuned KerasOCR with Aug: 16 ✓ vs. 1 ×. (e) Fine-tuned TrOCR (small) with MSC: 16 ✓ vs. 1 ×. (f) Fine-tuned TrOCR (large) with MSC: 16 ✓ vs. 1 ×.

Figure 14. Results by OCR models on a selected challenging document featuring an irregular and hard-to-recognize font. (a) Pre-trained KerasOCR: 11 ✓ vs. 11 ×. (b) Pre-trained TrOCR (large): 17 ✓ vs. 5 ×. (c) Fine-tuned KerasOCR with MSC: 4 ✓ vs. 18 ×. (d) Fine-tuned KerasOCR with Aug: 3 ✓ vs. 19 ×. (e) Fine-tuned TrOCR (small) with MSC: 16 ✓ vs. 6 ×. (f) Fine-tuned TrOCR (large) with MSC: 17 ✓ vs. 5 ×.

Table 1. Results from the different models on the MSC testing dataset. ‘MSC’ refers to the fine-tuning dataset collected from MSC documents, and ‘Aug’ denotes the augmented fine-tuning dataset.

OCR Model	Strategy	${time}_{epoch}$	$CER$ (%)	$WER$ (%)	$AVR . Lev$
Tesseract [30]	pre-trained	-	13.14	32.34	1.2166
EasyOCR [27]	pre-trained	-	10.43	30.09	0.6125
KerasOCR [19]	pre-trained	-	6.65	21.35	0.3724
TrOCR (small) [18]	pre-trained	-	5.54	20.89	0.3041
TrOCR (large) [18]	pre-trained	-	3.52	12.47	0.1931
KerasOCR w MSC [17]	fine-tuned	8.42	3.17	11.21	0.1494
KerasOCR w Aug [17]	fine-tuned	23.82	2.55	7.9	0.1419
TrOCR (small) w MSC	fine-tuned	348.52	1.71	7.3	0.0939
TrOCR (small) w Aug	fine-tuned	575.4	1.65	6.18	0.0907
TrOCR (large) w MSC	fine-tuned	645.36	1.30	4.37	0.0715
TrOCR (large) w Aug	fine-tuned	1311.2	0.09	3.94	0.005

Table 2. Results from the different models on the AirCorps testing dataset. ‘AirCorps’ refers to the fine-tuning dataset collected from AirCorps repository documents, and ‘Aug’ denotes the augmented fine-tuning dataset. We only report the ‘fine-tuned’ models’ results.

OCR Model	${time}_{epoch}$	$CER$ (%)	$WER$ (%)	$AVR . Lev$
KerasOCR w AirCorps [17]	4.36	8.2	27.63	0.4477
KerasOCR w Aug [17]	7.80	7.9	28.85	0.4328
TrOCR (small) w AirCorps	37.24	8.24	24.87	0.4477
TrOCR (small) w Aug	63.81	8.24	25.37	0.4477
TrOCR (large) w AirCorps	155.85	4.57	15.42	0.2478
TrOCR (large) w Aug	370.7	3.20	11.49	0.1741

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Khallouli, W.; Uddin, M.S.; Sousa-Poza, A.; Li, J.; Kovacic, S. Leveraging Transformer-Based OCR Model with Generative Data Augmentation for Engineering Document Recognition. Electronics 2025, 14, 5. https://doi.org/10.3390/electronics14010005

AMA Style

Khallouli W, Uddin MS, Sousa-Poza A, Li J, Kovacic S. Leveraging Transformer-Based OCR Model with Generative Data Augmentation for Engineering Document Recognition. Electronics. 2025; 14(1):5. https://doi.org/10.3390/electronics14010005

Chicago/Turabian Style

Khallouli, Wael, Mohammad Shahab Uddin, Andres Sousa-Poza, Jiang Li, and Samuel Kovacic. 2025. "Leveraging Transformer-Based OCR Model with Generative Data Augmentation for Engineering Document Recognition" Electronics 14, no. 1: 5. https://doi.org/10.3390/electronics14010005

APA Style

Khallouli, W., Uddin, M. S., Sousa-Poza, A., Li, J., & Kovacic, S. (2025). Leveraging Transformer-Based OCR Model with Generative Data Augmentation for Engineering Document Recognition. Electronics, 14(1), 5. https://doi.org/10.3390/electronics14010005

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Leveraging Transformer-Based OCR Model with Generative Data Augmentation for Engineering Document Recognition †