Keywords

1 Introduction

With the development of artificial intelligence and the Internet, how to distinguish between humans and machines has become a prevalent topic. Since Luis  [24] first proposed the concept of CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart), it has played an increasingly important role in the Internet security area. At present, CAPTCHA almost becomes the standard protection from attackers and is applied by many websites and applications.

Among the many types of CAPTCHAs, such as image-based, audio-based and puzzle-based CAPTCHAs, text-based is the most convenient and commonly used type. In this paper, we focus on the text-based CAPTCHA breaking technique because it shows the importance in various ways. For example, CAPTCHA breaking can verify the security of existing CAPTCHAs then promote CAPTCHA design algorithms. Furthermore, the technology of CAPTCHA recognition can not only refresh the limit of Turing test but also inspire text recognition approaches in other fields.

The traditional CAPTCHA recognition method typically needs three steps: preprocessing, segmentation and single character recognition to deal with the low resolution, noisy, deformation and adhesive text-based CAPTCHA. But this process is rather complicated and the accuracy is relatively low. Inspired by the great success of Deep Convolutional Neural Networks (DCNN), current CAPTCHA recognition methods always prefer deep learning techniques. However, DCNN with high accuracy tends to be inefficient with respect to size and speed, which greatly restricts their applications on the computationally limited platform. Therefore, we aim to explore a highly efficient architecture specially designed for low memory resources.

In this paper, we propose an accurate yet light and efficient model named ALEC for CAPTCHA recognition. The proposed method is inspired by scene text recognition and crafted especially for CAPTCHAs. A ResNet-like backbone and Connectionist temporal classification (CTC) are implemented as the whole framework. To overcome the drawbacks of conventional DCNN, such as computation complexity and model storage, many techniques are applied for optimization. More precisely, we adopt depthwise separable convolutions, channels reduction and group convolutions to achieve the trade-off between representation capability and computational cost, the channel shuffle operation is integrated to help the information flowing across feature channels and improving accuracy. Moreover, a light-weighted attention module called Convolutional Block Attention Module (CBAM) is utilized to enhance the feature extraction ability. In summary, our main contributions are summarized as follows:

  1. (1)

    We replace standard convolution with depthwise separable convolution to reduce parameters and improve recognition accuracy.

  2. (2)

    We introduce Group Convolution (GConv) and channels reduction to decrease network redundancy and guarantee generalization performance.

  3. (3)

    We integrate Convolutional Block Attention Module (CBAM) with our backbone that can improve accuracy with negligible overheads.

  4. (4)

    Experiments conducted on two generated CAPTCHA datasets and one real-world CAPTCHA dataset verify the effectiveness and efficiency of the ALEC.

The remaining parts of the paper are organized as follows: In Sect. 2, we illustrate the related work of CAPTCHA recognition and efficient model design. In Sect. 3, we describe our ALEC method in detail. In Sect. 4, we present the details and results of experiment and in Sect. 5, we conclude the paper.

2 Related Work

CAPTCHA Recognition. Nowadays, CAPTCHA plays an important role in the multimedia security system. The idea of CAPTCHA first appeared in the paper of Moni Naor  [4] who intended to design a mechanism which is easy for the user while difficult for a program or computer to solve. In 2000, as the first commercial text-based CAPTCHA was designed by Carnegie Mellon University (CMU) team, the research on CAPTCHA breaking technology and CAPTCHA security also started. At present, the frequently used CAPTCHAs  [21] are text-based, image-based, audio-based, etc. Since the text-based CAPTCHA is simple to implement and easy for human users to pass, it is most wildly used in various scenarios such as hot-mail, yahoo, g-mail, QQ and so on. Some typical text-based CAPTCHA is as follows:

Fig. 1.
figure 1

Samples of text-based CAPTCHA. (a) Gimpy randomly picks seven words from the dictionary and then renders a distorted image containing these words. (b) EZ-Gimpy contains a single word generated by various fonts and deformations. (c) The main idea of PessimalPrint is that low-quality text images are legible to human readers while still challenging optical character recognition (OCR).

In recent years, as the CAPTCHA design techniques become more robust by introducing large character set, distortion, adhesion, overlap or broken contours, various CAPTCHA recognition frameworks also come into being. The traditional methods usually locate a single number or character area in an image by segmentation and identify it. For example, Rabih et al. [15] dealt with the segmentation by fuzzy logic algorithm based on edge corners. Yan et al.  [28] proposed an efficient segmentation method for Microsoft CAPTCHA and recognized it by multiple classifiers. In [9], two new segmentation techniques called projection and middle-axis point separation were proposed with line cluttering and character warping. All in all, these systems mentioned above are with multiple stages, they are originally designed for a specific type of CAPTCHA and each module is optimized independently. As a result, errors compound between modules can be significant and the systems tend to behave poorly in terms of generalization performance.

In the last several years, most of the text-based CAPTCHA uses Crowded Characters Together (CCT) which remarkably reduces the success rate of the segmentation process. Therefore, some alternative promising approaches resort to the high level features  [27], such as deep learning [2, 3, 16]. The commonly used deep learning models in the CAPTCHA recognition are CNN, RNN, and so forth. [4] utilized a model consisting of two convolutional layers to learn image features. A fixed number of Softmax layers were introduced to predict each character of the fixed-length CAPTCHA. For the variable-length of CAPTCHA, they adopt a series of recurrent layers where the number of layers equals the maximum number of possible characters in CAPTCHA. Jing et al. [25] designed a modified DenseNet and achieved excellent performance on the CAPTCHA datasets of 9th China University Student Service Outsourcing Innovation and Entrepreneurship Competition. [12] chosen a combination of CNNs followed by the RNN trained on synthetic data and successfully broke the real-world CAPTCHA currently used by Facebook and Wikipedia. [29] applied the RNN in CAPTCHA recognition to avoid gradient vanishing problems and it can keep the long context in the network. [17] proposed a novel decoding approach based on the multi-population genetic algorithm and used the two-dimensional RNN to obtain relative information of both the horizontal and vertical context.

Efficient Models. As neural networks achieve remarkable success in many visual recognition tasks, the expensive computation and intensive memory become an obstacle to the deployment in low memory devices and applications with strict latency requirements. The idea of reducing their storage and computational cost has been a hot issue and tremendous progress has been made in this area. In the aspect of model simplifying, the common methods include model compression and efficient model design. Model compression  [7, 10, 23, 31] is to compress original model so that the network has fewer parameters yet little accuracy reduction. But model compression usually causes accuracy reduction and the reduction of model size is limited. The efficient model design represents a new neural network architecture that is specifically tailored for some desired computing ranges or resource-constrained environments. In recent years, the increasing needs for running deep neural networks on multiple devices encourage the study of various efficient model designs. Many states of the art models involved such an idea. For example, GoogLeNet considered improving utilization of the computing resources inside the network which ensures less computational growth while increasing the depth and width of the network  [22]. MobileNet introduced a streamline based architecture with depthwise separable convolutions and global hyper-parameters to build light weight deep neural networks  [8]. Afterward, an improvement of MobileNet called MobileNetv2 came into being [18], it is based on an inverted residual structure where the non-linearities are removed in the narrow layers and the shortcut connections are between the thin bottleneck layers. Another efficient model design family called ShuffleNet  [30] and ShuffleNetv2  [14] utilize new operations such as group convolution and channel shuffle to improve efficiency and derive four guidelines for efficient network design.

3 Method

As shown in Fig. 2, the proposed model named ALEC is an encoder-decoder structure in which the encoder is a residue-like network and the decoder is implemented by the CTC  [19]. In Sect. 3.1, an introduction about the overall architecture of the ALEC is given. In Sect. 3.2, we describe the details of repeated building blocks in the ALEC.

Fig. 2.
figure 2

Overview of the network architecture. The architecture consists of two parts: 1) encoder, which extracts feature sequence from the input image; 2) decoder, which generates final predicted sequence. Specifically, ‘n’ denotes the number of classes.

3.1 Model Architecture

A residue-like network is used to extract feature representation from input CAPTCHA image, which compresses 2D image into 1D feature map and maintains the width. And then the CTC utilizes extracted features to produce the predicted sequence.

Residual Structure Model. This research utilizes residual structure as the backbone of the CAPTCHA recognition network to extract features. With the plain network depth increasing, accuracy gets saturated and then degrades rapidly which is called degenerative problem. The residual structure model [6] involves ‘shortcut connections’ to address the degradation problem. Shortcut connections are those skipping one or more layers. Skipping effectively simplifies the network by using fewer layers in the initial training stages. Since there are fewer layers to propagate through, residual learning speeds the convergence by reducing the impact of vanishing gradients. In comparison with the plain network, the residual network reformulates the layers as learning residual functions instead of learning unreferenced functions. The ResNets structure has several compelling advantages: these residual networks are easier to optimize and can gain accuracy from considerably increased depth.

Connectionist Temporal Classification. The CAPTCHA recognition requires to translate the extracted feature maps to the prediction of sequences of labels. The crucial step is to transform the network outputs into a conditional probability distribution over label sequences. The network can then be used as a classifier by selecting the most probable label sequence. We adopt the conditional probability defined in the Connectionist Temporal Classification (CTC) layer [5] which is proposed by Graves et al. The advantage of this structure is that it does not require explicit segmentation.

The CTC output layer includes the number of labels plus one unit where the extra unit represents the probability of observing a ‘blank’ or no label. Given an input sequence, the CTC layer outputs the probabilities of all possible ways of aligning all possible label sequences. Knowing that one label sequence can be represented by different alignments, the conditional probability distribution over a label sequence can be found by summing the total probabilities of all possible alignments. Given an input sequence x of length T, the probability of outputting \(\pi \) is represented by the product of the probability of each element of \(\pi \), \(y_{\pi _t}^t\) is interpreted as the probability of observing label k at time t and \(L'\) is the dataset of possible characters plus blank.

$$\begin{aligned} p(\pi |x)=\prod _{t=1}^T y_{\pi _t}^t, \forall \pi \in L'^T \end{aligned}$$
(1)

The conditional probability of a given labeling l is the sum of the probabilities of all the paths \(\pi \) corresponding to it and the output of the classifier should be the most probable labeling for the input sequence:

$$\begin{aligned} p(l|x)=\sum _{\pi \in B^{-1}(l)}p(\pi |x) \end{aligned}$$
(2)
$$\begin{aligned} h(x) = \arg \max _{l\in L\le T} p(l|x). \end{aligned}$$
(3)

One of the distinctive properties of CTC is that there are no parameters to be learned for decoding. Therefore, this addresses our target is to ensure the efficiency of encoder feature extraction network. In the next subsection, we will describe the details of block structure in ALEC.

3.2 Blocks in the ALEC

We improve the blocks in the ALEC to accelerate prediction and decrease model size while maintaining the recognition accuracy. As shown in Fig. 3, standard convolution is replaced by depthwise separable convolution and group convolution followed by channel shuffle operation. The CBAM introduced in the ALEC boosts performance with slight computation cost.

Depthwise Separable Convolution (DSConv). To decrease network redundancy and improve effectiveness, the depthwise separable convolutions are applied in our method to replace the standard convolutions. Depthwise separable convolution  [20] divides the standard convolution into two parts: depthwise convolution and pointwise convolution. The depthwise convolutions and pointwise convolutions are illustrated in Fig. 7 where M is the number of input channels, N is the number of output channels and Dk * Dk is the kernel size. Each depthwise convolution filter only convolutes for per specific input channel. The pointwise convolutions combine multi-channel output of depthwise convolutions to create new features with the kernel size of 1 * 1. Taking the 3 * 3 convolution kernel used in the residual network as an example, in theory, depthwise separable convolutions can improve the efficiency by 9 times.

Fig. 3.
figure 3

Block structure in the ALEC. The block combines depthwise convolution, group convolution, channel shuffle and CBAM.

Group Convolution (GConv). Moreover, we combine group convolution with pointwise convolution as pointwise group convolution  [30] in order to decrease network redundancy and guarantee generalization performance. GConv  [11] is used to group the input feature map and convolute each group of features. We combine GConv with pointwise convolution from unit 2 to unit 4. Because the channel number of pointwise convolutions in unit 1 is less, the GConv operation is not performed. In the first pointwise convolution output of each block, the number of channels is extended to the same as the block output channel. In our experiment, we divided pointwise convolution into two groups.

There is no interaction between GConv. The output feature of GConv only comes from half of the input features. The input and output channels of different groups are not related, which will harm the feature expression and weight learning of convolution network. Therefore, we use channel shuffle in each block before adding the attention module which conduces feature interleaving and fusion.

Convolutional Block Attention Module (CBAM). Most of the well-designed CAPTCHAs are interfered by various kinds of noise including spots, curves or grids, therefore, we integrate the attention mechanism which plays an important role in feature extraction. Attention not only tells where to focus, but it also improves the representation of interests. [26] proposed a plug-and-play module for pre-existing base CNN architectures called Convolutional Block Attention Module.

This architecture consists of two attention modules: channel attention and spatial attention. Given an intermediate feature map, it first produces two spatial context features by computing an average pooling operation \(F_{avg}^c\) and a max-pooling operation \(F_{max}^c\) simultaneously. \(F_{avg}^c\) describes the global features and \(F_{max}^c\) gathers distinctive object features. The channel attention map is forwarded by a shared network of multi-layer perceptron (MLP) with both descriptors.

$$\begin{aligned} M_c(F)=\sigma (\textit{MLP}(F_{avg}^c)+\textit{MLP}(F_{max}^c)) \end{aligned}$$
(4)

Similarly to the channel attention map, the spatial one is obtained by first generating a 2D average-pooled feature \(F_{avg}^s\) and a 2D max-pooled feature \(F_{max}^s\) across the channel axis. Afterward, those are concatenated and convolved by a standard convolution layer to produce the spatial feature map. As shown in Fig. 4, these two complementary attention modules are placed in the sequential arrangement.

$$\begin{aligned} M_a(F)=\sigma (f^{3\times 3}([F_{avg}^s;F_{max}^s])) \end{aligned}$$
(5)

CBAM learns channel attention and spatial attention separately. By separating the attention generation process for a 3D feature map, it has much less computational and parameter overhead. Moreover, it can be plug at any convolutional block at many bottlenecks of the network.

Fig. 4.
figure 4

An overview of the CBAM in framework.

4 Experiments

In this section, we conduct extensive experiments on 3 benchmarks to verify the efficiency and effectiveness of the proposed method. In Sect. 4.1, an introduction about training and testing datasets are given. In Sect. 4.2, we describe the implementation details of the experiments. Finally, Detailed experimental results and comparison of different configurations are presented in Sect. 4.3.

4.1 Datasets

We utilize two different public CAPTCHA generators and a collection of real-world CAPTCHA to build our datasets.

For the generated CAPTCHA dataset, we generate 20000 images as the trainset and 2000 images for testing and validation respectively. The generated CAPTCHA images consist of 4 randomly selected alphanumeric characters including upper and lower cases. The generated annotations are case insensitive.

As for the real-world CAPTCHA, we collect various types of CAPTCHA via websites and randomly shuffle these images to construct a mixed CAPTCHA trainset of over 200000 samples then 20000 images for test and validation respectively. The distribution of the number of each CAPTCHA is approximately equal. The annotations are case insensitive.

HsiaomingFootnote 1 is a captcha library that generates image-based, text-based and audio-based CAPTCHAs. We only use text-based CAPTCHA with random color, curves and dots as noises. The text font is set to be DroidSansMono.

SkyduyFootnote 2 is another python-based public CAPTCHA generator with dense points background noise and a random color curve. The text font is randomly chosen from “FONT_HERSHEY_COMPLEX”, “FONT_HERSHEY_SIMPLEX” and “FONT_ITALIC”. Some samples are shown in Fig. 5.

Real-World CAPTCHA. As real-world CAPTCHA exhibits a lot of variation in their design, in order to verify the robustness and effectiveness of our proposed framework, we collect 24 different types of real-world CAPTCHA via web access. These samples cover various kinds of CAPTCHA features, including hollow shapes, adhesion, distortion, unfixed length and interference lines, etc. Some samples are shown in Fig. 6.

Fig. 5.
figure 5

Samples of Hsiaoming (a)(b)(c) and Skyduy (d)(e)(f)

4.2 Implementation Details

The network configurations in the experiments are summarized in Fig. 8. The ALEC modifies the standard convolutions of the first unit to a combination of pointwise convolutions and depthwise convolutions. For the rest of the units, group convolutions along with depthwise separable convolutions are adopted. Moreover, the CBAM module is embedded in every ResNet block to introduce spatial and channel attention. The input image size is fixed as 75 \(\times \) 32. Some image process such as random rotation of −3 to 3\(^\circ \), elastic deformation, random contrast, brightness, hue and saturation are applied for data augmentation. CTC output layer is used to produce the prediction sequence.

The network is trained with stochastic gradient descent with warm restarts learning rate strategy (SGDR)  [13], setting the minimum learning rate \(\eta ^i_{min}\) to 0, the maximum learning rate \(\eta ^i_{max}\) to 0.05. \(T_{cur}\) accounts for the number of epochs performed since the last start, \(T_i\) is a prefixed constant and is set to be 15000.

$$\begin{aligned} \eta _t=\eta ^i_{min}+\frac{1}{2}(\eta ^i_{max}-\eta ^i_{min})(1+\cos (\frac{T_{cur}}{T_i}\pi )) \end{aligned}$$
(6)
Fig. 6.
figure 6

Samples of real-world CAPTCHA from websites

We implement the network with the Tensorflow framework  [1] and experiments are carried out on a workstation with a 2.20 GHz Intel (R) Xeon (R) CPU, 256 GB RAM and a NVIDIA Tesla V100 GPU. The batchsize is set to 64 and it takes about three hours to train the CAPTCHA recognition model on each dataset.

Fig. 7.
figure 7

The standard convolution is replaced by two layers: depthwise convolution in (a) and pointwise convolution in (b). Pointwise group convolution in (c) combines pointwise convolution with group convolution.

Fig. 8.
figure 8

Network configurations of ALEC. The kernel size, pooling size, stride and channels are shown in brackets with number of layers.

4.3 Experimental Results

To demonstrate the effectiveness of ALEC and every utilized module, we conducted several experiments on generated CAPTCHA and real-world CAPTCHA. The model accuracy, parameter numbers and inference time are given in Table 1 and in Table 2.

Comparison Between Standard Convolution and DWConv. In Table 1, we see that depthwise separable convolutions improve the accuracy by 1.14% compared to the standard convolutions on Hsiaoming and 0.94% on Skyduy with about 68% reduction for parameters and about 2.7 times actual speedup.

The Effect of GConv. Group convolution can reduce the computation complexity by transforming full-channel convolutions to group-channel convolutions. From the results, we see that models with group convolution can reduce parameter scale by about 48% while the accuracy only reduces 0.32% on Hsiaoming and 0.55% on Skyduy with the speed almost unchanged.

The Effect of Channel Number. We compare ResNet-18 models with two different channel settings, namely the light and standard. The channel number of the light version is only a quarter of the standard version. Experiment shows that the former can acquire 0.87% accuracy improvement with about 80% parameter reduction on Hsiaoming. This modification of channel number can significantly reduce the number of parameters and improve the computational efficiency. Moreover, CAPTCHA recognition tends to get over-fitting easily, this structure allows to lower the capacity of the model and achieve better generalization performance.

Combination with Attention. CAPTCHA usually uses disturbance of background and other noises to increase the difficulty of recognition. To solve this problem, we utilize an attention mechanism called CBAM which can be integrated into convolution neural networks with negligible overheads. As shown in Table 1, models with CBAM only increases about 14 K parameters but improves the accuracy by 0.70% on Hsiaoming and 0.22% on Skyduy.

We also conduct experiments on the real-world CAPTCHA dataset. As shown in Table 2, the experimental results are consistent with generated CAPTCHA, the ALEC acquires 0.25% accuracy improvement which also verifies the validation of our method. With these modules mentioned above, the ALEC can achieve better efficiency and significant performance on both generated and real datasets. Moreover, with fewer parameters, the ALEC can be deployed on more kinds of platforms, such as mobile phones and embedded devices.

Table 1. Comparison for different network structure. In the last column we report running time in milliseconds (ms) for a single core of the Intel (R) Xeon (R) CPU E5-2650 v4 @ 2.20GHz. “Standard” indicates standard ResNet-18.
Table 2. On the real-world CAPTCHA dataset, we compare the ALEC with standard ResNet-18.

5 Conclusion

In this paper, we propose an accurate, light and efficient network for CAPTCHA recognition called ALEC which integrates depthwise separable convolution and group convolution. Moreover, effective and efficient attention modules are applied to suppress the background noise and extract valid foreground context. All these properties make ALEC an excellent approach for CAPTCHA recognition. Comprehensive experiments demonstrate that the ALEC achieves superior accuracy with higher speed and fewer parameters, compared with standard convolution networks. Actually, the significance of this research is not only for CAPTCHA recognition but also can be applied to the related researches of text recognition. In the future, we will study CAPTCHA recognition on improving accuracy and speeding up.