Keywords

1 Introduction

Handwritten Chinese text recognition (HCTR), as a challenging problem in the field of pattern recognition, has long received intensive concerns and observed steady progresses. The difficulties of HCTR are mainly from three aspects: a very large character set, diversity of writing styles and the character-touching problem.

To overcome these difficulties, researchers have proposed many solutions. Over-segmentation-based methods are very popular due to their efficiency and interpretablity. They slice the input string into sequential character segments, construct a segmentation-recognition lattice, and search for the optimal recognition path. For example, Zhou et al. [24] proposed a method based on semi-Markov conditional random fields, which combined candidate character recognition scores with geometric and linguistic contexts. Zhou et al. [25] described an alternative parameter learning method, which aimed at minimizing the character error rate rather than the string error rate. Wang et al. [16] proposed to train heterogeneous CNN with hierarchical supervision information obtained from the segmentation-recognition lattice. Wu et al. [18] investigated the effects of neural network language models and CNN shape models.

However, over-segmentation-based methods have their own limitations: their effectiveness highly rely on the over-segmentation step. In other words, the system would perform badly if the text is not well segmented. To solve this problem, segmentation-free methods have been proposed based on deep learning frameworks [9, 11, 12, 19]. Messina and Louradour [9] applied Multi-Dimensional Long Short Term Memory (MDLSTM) with Connectionist Temporal Classifier (CTC) to offline HCTR. Shi et al. [11] proposed a network called Convolutional Recurrent Neural Network (CRNN) which consists of convolutional layers, recurrent layers and a transcription layer for scene text recognition. Wu et al. [19] presented a separable MDLSTM network which aimed to explore deep structure while at the same time reducing the computation efforts and resources of naive MDLSTM. As is well-known, deep learning is data driven, and data preprocessing and augmentation is very important for the system. In 2001, Vinciarelli and Luettin [14] proposed a normalization technique for cursive handwritten English words by removing the slant and slope of the words. Inspired by this work, Chen et al. [1] adopted similar method in the data preprocessing step of their system for online HCTR. However, to the best of our knowledge, no works have studied the importance of data preprocessing and augmentation for offline HCTR systematically.

In this paper, we propose a high-performance system for offline HCTR, which includes a data preprocessing, augmentation pipeline and a CNN-ResLSTM model. To prove the validity of the proposed method, we use the methods of controlling variates. Firstly, we train the baseline model with the CNN-ResLSTM (Long Short Term Memory with residual connections) model using the original training samples generated from CASIA-HWDB2.0-2.2. And then, the data augmentation and preprocessing methods including text sample generation, text sample preprocessing and text sample synthesis have been added to the training process step by step. Experiments show that the proposed pipeline can effectively and robustly improve the performance of the system. After verifying the validity of the proposed data augmentation and preprocessing pipeline, we adapt it on three models – the proposed CNN-ResLSTM model, the traditional CNN-LSTM and CNN-BLSTM models, and verify the superiority of our CNN-ResLSTM model. Finally, by combining with language model, our system outperforms the previous state-of-the-art system with a correct rate (CR) of 97.28% and an accurate rate (AR) of 96.97% on the CASIA-HWDB, and a CR of 96.99% and an AR of 96.72% on the ICDAR-2013 handwriting competition dataset. Furthermore, we accelerate and compress the CNN-ResLSTM model with Tucker decomposition [7], singular value decomposition (SVD) and adaptive drop weights (ADW) [20, 22] to make it more practical for real-time applications. On the ICDAR-2013 database, the accelerated and compressed model requires only 146 ms for one text line image on average on a i7-8700K CPU with a single thread and 2.8 MB for storage, yet maintains nearly the same recognition accuracies as compared to the full model.

This paper is organized as follows: Sect. 2 introduces the data preprocessing and augmentation pipeline detailly; Sect. 3 gives the whole end-to-end recognition system; Sect. 4 presents experimental results and Sect. 5 makes a conclusion.

Fig. 1.
figure 1

Data preprocessing and augmentation pipeline.

figure a
figure b

2 Data Preprocessing and Augmentation Pipeline

In this section, we introduce our data preprocessing and augmentation pipeline, including text sample generation, text sample preprocessing and text sample synthesis, following the process steps at training phase, as well as testing sample preprocessing.

2.1 Text Sample Generation

Training text sample generation is specially designed for CASIA-HWDB2.0-2.2 dataset with the following operations. First, we only choose part (576 pixel length of the images in this paper) of the original text images for training each time, because shorter training images can speed up the training process several times for each mini-batch without sacrificing any recognition performance. As for testing, the length of the image is increased to 3,000 to contain the full text lines in the test sets. Second, we shuffle the characters within training samples to avoid the overfitting problem, because the test set of CASIA-HWDB2.0-2.2 share the same corpus with the training set (although written by different writers).

Thanks to the careful and detailed annotation of database CASIA-HWDB2.0-2.2, document images are not only segmented at text line level, but also at character level with bounding box information available. As shown by red arrow in Fig. 1, we first randomly select a character from the text samples and then extend in the right direction until length of 576 is reached. If the end of the original text line is reached while the generated text line are far smaller than length of 576, we continue to extend in the left direction. After that, we can crop the corresponding area from the original images and perform character shuffle operation with respect to their bounding box positions. Typical randomly generated training sample is shown in Fig. 1(a).

2.2 Text Sample Preprocessing

Offline handwritten text line suffers from the diversity of writing style. In this paper, we treat the offline handwritten text line images as pixels collection. Denote the pixel collection of the randomly generated training sample as follows:

$$\begin{aligned} \mathbf {I} = \{p_i|p_i=((x_{i}, y_{i}), \ v_{i}),\ i=1,\cdots ,N\}, \end{aligned}$$
(1)

where \((x_{i}, y_{i})\) denote the (x, y) coordinate, \(v_{i}\) represents the pixel value, and N is the product of image height and width. Therefore, the pixel collection of foreground characters in the text image can be represented as:

$$\begin{aligned} \mathbf {C} = \{c_j|c_j = ((x_{j}, y_{j}), \ v_{j}), v_{j} \ne 255, c_j \in \mathbf {I} \}, \end{aligned}$$
(2)

where we neglect the background with pixel value of 255.

Now, given the coordinates of character pixels, we can apply de-slope operation on the generated training samples with linear function \(x_j \times \hat{k} + \hat{b}\), which is derived through linear least square curve fitting approach as follows:

$$\begin{aligned} \hat{k}, \hat{b}=\mathop {\text {arg min}}\limits _{k, b}\sum _{j}\Vert x_j \times k + b - y_j \Vert \end{aligned}$$
(3)

Then, the de-slope operation is accomplished by transforming the Y-axis of the coordinates of character pixels as follows:

$$\begin{aligned} y'_{j} = (y_{j} - (x_{j} \times \hat{k} + \hat{b})) + \mathcal {H}, \ \ c_j = ((x_{j}, y_{j}),\ v_j) \in \mathbf {C}, \end{aligned}$$
(4)

where \(\mathcal {H}\) is chosen as half the height of the normalized image, e.g. 64 in this paper. As shown in Fig. 1(b), the de-slopped text sample can now be represented as follows:

$$\begin{aligned} \mathbf {C'} = \{c_j'|c_j'=((x_{j}, y'_{j}), \ v_{j}),v_{j} \ne 255, c_j \in \mathbf {I} \} \end{aligned}$$
(5)

Next, we remove redundant blank areas in the upper and lower parts of the image, as shown in Fig. 1(c). And then, we briefly introduce border line of the text sample and its calculation, as shown by dash blue line in Fig. 1(c). As we all know, the height of text image are extremely unstable, depending on the highest and lowest point of the characters; thus, we need some stable measurements of the height of all text samples for normalization. In this paper, we present border line to estimate the ‘height’ of a text sample with the following assumption. First, a border line is a straight line parallel to the X-axis. Second, the pixels of characters between the upper and lower border line should be approximately 80%, i.e. about 20% of pixels (tolerable pixels) equally distributed outside the upper and lower border lines. In Algorithm 1, we detail the calculation of upper border line, and the lower border line is calculated in a similar way.

Finally, we rescale the text images so that distance between the upper and lower border line is exactly 64, and shift character pixels so that the border lines are mediately symmetry about picture in the vertical direction as shown in Fig. 1(d).

2.3 Text Sample Synthesis

We synthesize training samples using isolated characters of CASIA-HWDB1.0-1.2 for two reasons: (1) We want to enrich the text training set so that the distribution of the training set can cover more writing styles and habits of real writers. (2) The training set of CASIA-HWDB2.0-2.2 only has 2703 character classes that is not enough for real world applications, so we want to expand the character classes of our system to make it more practical.

At training phase, the input images to our model are gray-scale, with height and width fixed to 126 and 576, respectively. For consistency, we synthesize 750,000 text line images with the same size based on isolated characters of CASIA-HWDB1.0-1.2. The detail is described in Algorithm 2. The synthesized text samples are used together with text samples generated using aforementioned methods to create a mixed training set. It is noteworthy that, we put the isolated characters on the same horizontal line when we synthesize the sample, so we do not need to apply the preprocessing steps to the synthesized text samples.

2.4 Testing Sample Preprocessing

At testing phase, we evaluate our system on datasets described in Sect. 4. Given a testing sample image, we first convert it to gray-scale with height of 126 and width of 3,000 (3,000 is enough to contain the test samples in the test sets), then normalize it with the text sample preprocessing operations described in Sect. 2.2.

Fig. 2.
figure 2

Overview of our end-to-end recognition system.

Fig. 3.
figure 3

One ResLSTM layer in our model. It is constructed by introducing a shortcut (i.e. residual connection) into the naive LSTM layer.

3 End-to-end Recognition System

As shown in Fig. 2, our end-to-end recognition system consists of three components. First, the raw input text image is processed by the data preprocessing and augmentation pipeline. After that, the convolutional neural network (CNN) extracts a feature sequence from the processed image and fed into the ResLSTM module to generate a probability distribution over the character dictionary for each time step. Finally, the transcription layer derives a label sequence from the distribution. The detailed setting of the proposed recognition system is described as follows.

3.1 CNN-ResLSTM Model

The offline text data has both visual (optical characters) and sequential (context) properties. Since CNN is a powerful visual feature extractor, while recurrent neural network (RNN) is good at modeling sequence, it’s natural to combine CNN and RNN to explore their full potential for offline HCTR. Convolutional Recurrent Neural Network (CRNN) [11] was proposed to recognize scene text by combining CNN and Bidirectional-LSTM (BLSTM). Residual connections, the key component of ResNet [5], accelerate the convergence speed and increase network performance by reducing the degradation problem of deep networks, transferring the network to an ensemble model of many shallow models [13].

Inspired by CRNN and ResNet, we proposed a CNN-ResLSTM model by introducing residual connections into the RNN part of the CRNN structure. Besides, we use LSTM instead of BLSTM for efficiency consideration. As explained in [21], CNN-ResLSTM model has the advantages of faster convergence and better performance than naive LSTM. Our model consists of 11 convolution layers and 3 ResLSTM layers each with 512 LSTM cells. ReLU activation function is used after each convolution layer except the last one. Batch normalization is used after the last convolution layer. Dropout with drop-ratio 0.3 is introduced after each ResLSTM layer. A diagram of the ResLSTM layer is shown in Fig. 3. We illustrate the forward process of the CNN-ResLSTM in detail as follows.

Given input gray-scale image with shape of \(1 \times 126 \times W\) (\(channel \times height \times width\)), the CNN module extracts features from the processed image and outputs feature maps with shape of \(512 \times 1 \times [W/16]\). These feature maps are then converted to [W/16] frames of feature vectors, with each frame containing a 512-dimensional feature vector and serving as one time step for the ResLSTM module. The ResLSTM module generates feature vector for each frame, and a fully connected (FC) layer predict probability distribution with a 7357-dimensional score vector for each frame. The first dimension of the score vector is reserved as the blank symbol of Connectionist Temporal Classification (CTC) [4], and the rest 7356 dimensions correspond to our character set.

3.2 Transcription

Since our CNN-ResLSTM model does a per-frame prediction and the number of frames is usually more than the length of label sequence, we need to decode the predictions, i.e., mapping the predictions to their corresponding labels. Besides, we need to define a special loss function, so that we can train the model end-to-end. To solve this problem, we adopt CTC [4] as our transcription layer. CTC allows the CNN module and ResLSTM module to be trained jointly without any prior alignment between input image and its corresponding label sequence. As suggested in [4], since the exact positions of the labels within a certain transcription cannot be determined, CTC consider all the locations where the labels could appear, which allows a network to be trained without pre-segmented data. A detailed forward-backward algorithm to efficiently calculate the negative log-likelihood (NLL) loss between the input sequences and the target labels was described in [3].

3.3 Decoding and Language Model

Decoding the CTC trained CNN-ResLSTM model can be easily accomplished by the so-called best path decoding, also named naive decoding [3]. To further increase the performance of our system, explicit language model (LM) is integrated to explore the semantic relationships between characters. By incorporating lexical constraints and prior knowledge about the language, language model can rectify some obvious semantic errors, thus improves the recognition result. In this paper, we only considered character trigram language model in experiments and a refined beam search algorithm [4] is adopted to decode our CNN-ResLSTM model.

3.4 Model Acceleration and Compression

In our previous work, [20] and [22], we have proposed a framework including ADW to compress the CNN and LSTM network, and use the SVD decomposition method to accelerate the LSTM and FC layer. For CNN acceleration, we adopt the tucker decomposition methods [7]. In this paper, we combined them together to accelerate and compress our CNN-ResLSTM model. And we also give some of our experience in accelerating and compressing the model, which will be discript in Sect. 4.3.

4 Experiments

4.1 Dataset

The training set of CASIA-HWDB [8] was used as training dataset. Specifically, the training set of CASIA-HWDB is divided into 6 subsets, in which CASIA-HWDB1.0-1.2 contain isolated characters and CASIA-HWDB2.0-2.2 contain unconstrained handwritten texts. In the training set of isolated characters, there are 3,118,477 samples of 7,356 classes. The training set of the texts contains 41,781 text lines of 2,703 classes. As mentioned before, we use the isolated characters to synthesize 750,000 text lines to enrich the training set. We evaluate the performance of our system on the test sets of CASIA-HWDB (denoted as D-Casia) and ICDAR-2013 Chinese handwriting recognition competition [23] Task 4 (denoted as D-ICDAR), which contain 10,449 text lines and 3,432 text lines respectively.

Note that, in CASIA-HWDB2.0-2.2 and the competition test set, characters outside the 7356 isolated character classes are not used in both training and testing.

4.2 Experimental Settings

We implemented our system on Caffe [6] deep learning framework. The CNN-ResLSTM model was trained with CTC criterion using mini-batch based on stochastic gradient decent (SGD) method with momentum. Momentum and weight decay were set to 0.9 and \(1 \times 10^{-4}\) respectively. Batch size was set to 32. The initial learning rate was set to 0.1 for the first 150,000 iterations, then reduced to 0.01 and 0.001 for two more 50,000 iterations. Inspired by [1], we turned off the shuffle operation applied to the random generated text samples for another 50,000 iterations. Therefore, we trained the model for 300,000 iterations in total and it takes approximately 20 h to reach convergence using a single TitanX GPU.

Table 1. Effect of data processing pipeline (without LM)

4.3 Experimental Results

Effect of Data Processing Pipeline. We evaluated our CNN-ResLSTM model using the proposed data processing pipeline on both D-Casia and D-ICDAR, as shown in Table 1. The baseline refers to the CNN-ResLSTM model trained with the training set from CASIA-HWDB2.0-2.2 without using the proposed data processing pipeline.

Table 2. Effects of residual connections (without LM)
Table 3. Acceleration and compression result

We started our experiments using training samples generated from CASIA-HWDB2.0-2.2, but without characters shuffle operation. And then, we introduce the characters shuffle operation, text sample synthesis and synthesized text samples step by step. As we can see, when we trained the model only with the generated samples, the performance on D-CASIA is better than the baseline but much worse on D-ICDAR. The reason may be that D-ICDAR has more cursively samples, and the training set for the baseline may include more cursively samples than the generated samples, since the length of the generated samples is shorter than the original samples. By adding character shuffle operation, we observed that the performance is improved on D-ICDAR, but surprisingly reduced on D-Casia. This is because when we only using training samples generated from CASIA-HWDB2.0-2.2, the model overfits D-Casia (D-Casia share the same corpus with the training set of CASIA-HWDB2.0-2.2) and the shuffle operation can reduce this phenomenon. By adding text samples preprocessing, we observed the performances are significantly improved, especially on D-ICDAR. This is probably because D-ICDAR has much more cursively written samples, and our text samples preprocessing operation can effectively normalize the text samples to improve recognition result. Finally, we performed experiments by randomly picking samples from the generated texts or synthesized texts with different ratios. We obtain the best model with ratio of 0.5, which we adopt in the following experiments.

Effects of Residual Connections. As shown in Table 2, to investigate the effect of the residual connections in CNN-ResLSTM model, we further constructed two models namely the CNN-LSTM and CNN-BLSTM. The CNN-LSTM was constructed by just removing the residual connections of CNN-ResLSTM, while the CNN-BLSTM was derived by replace LSTM layers of CNN-LSTM with BLSTM layers of 1024 cells (\(2 \times 512\)). As shown in Table 2, the CNN-ResLSTM achieved the best results among these three models, which verifies the significance of the residual connections. It is interesting to note that the results of CNN-LSTM and CNN-BLSTM are comparable, even though BLSTM has the potential to capture contextual information from both directions.

Acceleration and Compression. To build a compact model, we firstly employed low-rank expansion method to accelerate the model. A natural idea is that decomposing the CNN, LSTM and FC layers at once and then finetuning the network, but we find that the accuracy of the model would decrease a lot with such decomposing strategy. We adopt a 3-step strategy to decompose the model: (1) Decompose the convolutional layers from original CNN-ResLSTM model, keep the LSTM and FC layers intact, and only update the decomposed layers. Then we get the model named model-1; (2) Decompose the LSTM and FC layers from original CNN-ResLSTM model, keep the convolutional layers intact, and only update the decomposed layers. Then we get the model named model-2; (3) Extract the decomposed CNN layers, LSTM layers and FC layers from model-1 and model-2 respectively, and combine them together. Finally, we fine-tune the combined model and obtain the final decomposed model. And for the pruning part, we prune 10% connections for the layers which contain more than 100,000 parameters and clustering with 256 cluster centers.

As shown in Table 3, with about 1% accuracy loss, the storage of the model has dropped from 61 MB to 2.8 MB, which is 21.8 times smaller. And the GFLOPs is reduced from 16.57 to 4.46, which accelerate 3.7 times theoretically. In our forward implementation, the average time of a forward calculation is reduce from 318 ms to 146 ms on a i7-8700K CPU. Furthermore, the compact model could also be combined with the language model to improve the performance.

Table 4. Comparison with the start-of-the-art methods

Comparison with Other Methods.

As we can see in Table 4, our method achieved the best results in all measurements compared our method with other previous state-of-the-art approaches. The approaches we compared include over-segmentation-based approaches [16, 18] and segmentation-free method [19] based on the CNN-RNN-CTC framework. Compared with [19], we can verify the significance of the proposed data processing pipeline and residual LSTM, which are the main differences between our method and [19]. It is worth noting that after model acceleration and compression, the result of the compact model is still outperform than other methods on dataset D-Casia and comparable on dataset D-ICDAR.

5 Conclusion

In this paper, we proposed a high-performance method for offline HCTR, which consists of a data preprocessing and augmentation pipeline and a CNN-ResLSTM model. The text images are first processed by the pipeline, and then fed into the CNN-ResLSTM for training and testing. Our data preprocessing and augmentation pipeline includes three steps: training text sample generation, text sample preprocessing, and text sample synthesis using isolated characters. The model consists of two parts: CNN part and ResLSTM part, which are jointly trained with CTC criterion. To further improve the performance of our system, we integrated a character trigram language model to rectify some obvious semantic errors. Experiments show that the performance of our CNN-ResLSTM model is improved step by step in our data preprocessing and augmentation pipeline. Compared with previous state-of-the-art approaches, our method exhibits superior performance on dataset D-Casia and D-ICDAR. Furthermore, in order to make our system more practical, we employed the model acceleration and compression method to built a compact model with small storage size and fast speed, which still outperforms than the previous state-of-the-art approaches on dataset D-Casia and have a comparable result on dataset D-ICDAR.