Paper Review: “PP-OCR: A Practical Ultra Lightweight OCR System” — Part II

In this post, I’ll review PP-OCR which is a practical ultra-lightweight OCR system and can be easily deployed on edge devices such as cameras, and mobiles,… (Part II)

5 min readMay 22, 2022

Original Paper: https://arxiv.org/pdf/2009.09941.pdf

In this part II, I will review the direction classification model and text recognitor in PP-OCR system. You can read my review part I here:

Part I: Review overall architecture and text detector on paper

1. Direction Classification

**Fig. 1:** The framework of the proposed PP-OCR ([1])

Figure 1 shows the overall architecture of PP-OCR. It has three parts: text detection, detected boxes rectification and text recognition. In part I, I reviewed text detector and strategies to improve effectiveness and efficiency (the model size of the text detector is only 1.4M). After detecting the text box, we need to transform the box into a horizontal rectangle box for the text recognition task. The authors propose four strategies:

Light Backbone: MobileNetV3 is a light model for classification tasks. Therefore, they use MobileNetV3 small x0.35 to be the backbone model. Experimentally, the author found that the results will not improve if larger backbones are used

**Fig. 2:** The performance of some backbones on the ImageNet classification ([1])

Data Augmentation: The authors use BDA (Base Data Augmentation) ([2]) and RandAugment ([3]) for training images of the direction classification. BDA includes rotation, perspective distortion, motion blur and Gaussian noise.

**Fig.3:** Example images augmented by RandAugment ([3])

Input Resolution: Normally, accuracy will be improved if the input resolution is increased. In PP-OCR, the height and width are set as 48 and 192, respectively, to improve the accuracy of the direction classifier.

PACT Quantization: Quantization is a technique in which model data (model parameters and activations) are converted from a floating-point representation (float32) to a lower-precision representation (float16 or int8). This reduces the size of the model, and latency, and increases inference speed. The authors use PACT for quantization. I wrote a review paper PACT Quantization. You can check it out here.

With PACT, this proposes scheme replaces ReLU with an activation function with a clipping parameter, α, that is optimized via gradient descent based training.

2. Text Recognition

In Paddle-OCR, the authors use CRNN as a text recognizer.

Figure 4 shows CRNN architecture. The architecture consists of three parts: 1) convolutional layers, which extract a feature sequence from the input image; 2) recurrent layers, which predict a label distribution for each frame; 3) transcription layer, which translates the per-frame predictions into the final label sequence. To enhance the model ability and reduce the model size, the authors propose nine strategies.

Light Backbone: They choose MobilenetV3 as the backbone of the text recognizer. The model size is just increased by 2M, the accuracy is improved obviously.

Data Augmentation: Besides BDA, TIA (Luo et al. 2020) also is an effective data augmentation method for text recognition.

Fig. 5: Illustration of data augmentation, TIA

Cosine Learning Rate Decay: Similar to text detection, cosine learning rate decay is also applied for text recognition

**Fig 6:** Cosine learning rate decay and piecewise decay ([1])

Feature Map Resolution:

Figure 7 shows feature map resolution of CRNN model in PP-OCR. The author modify the stride of the second down sampling feature map from (2,1) to (1,1).

Regularization Parameters: To advoid overfitting, the author add L2 regularization to the loss function.

With the help of L2 regularization, the weight of the network tend to choose a smaller value, and finally the parameters in the entire network tends to 0, and the generalization performance of the model is improved accordingly ([1])

Learning Rate Warm-up: The authors use learning rate warm-up from [6] in the training process. At the beginning of the training, all parameters are typically random values and therefore far away from the final solution. Using a too large learning rate may result in numerical instability. In the warmup heuristic, they use a small learning rate at the beginning and then switch back to the initial learning rate when the training process is stable.

Light Head: In PP-OCR, the dimension of the sequence features is
set to 48 empirically.

Pre-trained Model: The authors use the synthesized dataset to train tens of millions samples. Through experiments, they demonstrate the accuracy can be significantly improved with the above models.

PACT Quantization: Similarly, they use PACT to reduce the model
size of a text recognizer except for skipping the LSTM layers.

3. Some image results in papers

4. Conclusion

In this article, I reviewed direct classification and text recognition in PP-OCR which is a light system for OCR problems. In the next part, I will write a post about how to set up, train, and test the model.

You can find the official source code at:

GitHub - PaddlePaddle/PaddleOCR: Awesome multilingual OCR toolkits based on PaddlePaddle (practical…

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages…

github.com

If you have any questions, please comment below or contact me via linkedin or github

If you enjoyed this, please consider supporting me. Thank you very much!

Resources:

[1] PP-PCR: https://arxiv.org/pdf/2009.09941.pdf

[2] https://arxiv.org/pdf/2003.12294.pdf

[3] RandAugment: https://arxiv.org/pdf/1909.13719.pdf

[4] CRNN: https://arxiv.org/pdf/1507.05717.pdf

[5] TIA: https://arxiv.org/abs/2003.06606

[6] Learning rate warm-up: https://arxiv.org/pdf/1812.01187.pdf