1 Introduction
Captchas are now primarily the first level of standard security technology, used in the organizational websites [
43,
44]. Captcha systems are widely used on websites to provide security against malicious computer programs and bots. The automated computer bots are specifically created to scrape data from institutional websites. Compromising the captcha system has a severe effect on organizational operations due to the disruption of services on websites. Most of the the top commercial sites are well equipped to deal with threats and vulnerabilities. However, government websites that mostly depend on a third party for maintenance are not always able to handle the risk. The wealth of information, possible through the extraction of data from websites, is immense. Primarily, government websites are used for public services. Therefore, the websites essentially possess critical information regarding the people and assets of the country. In today’s age of information technology, data are priceless, and data scraping through bots from government websites always poses a significant security challenge. Captcha systems are responsible for combating the threat, and periodic analysis is necessary to evaluate the overall situation. In this article, we focus on the captcha systems used on the government and state websites of India, and a few major websites are highly vulnerable. To examine the strength of a real-time security system, it is always advisable to analyze the system from the viewpoint of an attacker. If an attacker can easily evade the security system, then the risk to the system is higher. The severity of the risk is inversely proportional to the effort and resources required by an attacker to break the system. The attacker would prefer to activate a threat, which would yield maximum profit. In an attacker’s view, two points are essential to consider before initiating an attack on a captcha system. First is the effort required to break the captcha system. Second is the gain from compromising the captcha system, which includes information scraping and disruption in critical services of the web-server.
Text captchas in image format have been the most widely used captcha technique until now [
7]. Other captcha formats, like object recognition from multiple images and audio recognition captchas, are used on a few commercial websites. A lot of work has been done in recent times to automatically recognizing
text captchas [
4,
31,
47] in image format using deep learning and traditional machine learning algorithms. The use of
text instructions, along with images incorporating the text is another type of captcha that has been used by a few of the important Indian government websites. The
text instructions captcha represents a domain of problems where text and image together need to be understood by a machine-learning algorithm to reach a solution. The architectures proposed in work for breaking the
text instructions captcha is useful for similar use cases (e.g., e-commerce products based on images and text), which deal with image and text, simultaneously. The
text instructions are represented in various formats, for example, the instruction text is depicted as “Evaluate the expression,” and alongside an image is “
\(2+4\) .” The captcha solution should provide “6.”
Text instructions captchas throw a new challenge, because image recognition and understanding of the
text instructions are both required to solve the captcha. Extracting the captcha answer in text format from such
text instructions–based captcha models has not been reported yet to the best of our knowledge. The
text instructions–based captcha system becomes more complex when the
text instructions are embedded in the image. In this work, we propose neural network models for solving
text instructions–based captchas, which include images with text embedded in the images.
From the attacker’s perspective, extensive rule representation and feature extraction to solve the captcha system is always an unacceptable proposition, because it involves extensive effort and human perception and expertise. The other factor, in attacker’s viewpoint, is inference time for solving the captcha. Deep neural network models are the best options, which have attracted the attention of researchers because of the automatic feature extraction capability of the neural network. This property of neural networks suits attackers with limited expertise in the field of image and text processing. The end-to-end deep neural network architecture requires minimal or no preprocessing, and features are extracted automatically using the Lecun (1989) backpropagation [
25] algorithm. The neural network architecture with a low inference time increases the effectiveness of an attack. The higher the frequency of the attack, the more effective it will be in damaging the service provided by a web-server. The frequency of attack is determined by the number of valid requests made to the server, which is only possible after solving the captcha. Therefore, less inference time contributes to the success of an attacker. Neural network-based processing always requires a huge dataset for deep architecture, involving high training and inference time. We focus on keeping the architecture simple (not too deep) and the parameters as minimal as possible to reduce the inference time for solving the captcha systems by the attackers.
There are three types of captcha found on the government websites of India. First is the most-used basic image-based text captcha, and the rest are a form of text instructions–based captchas. A detailed analysis is presented in this article for each kind of captcha to show the effort required to break the captcha systems. We have created a real-time dataset from the government websites for all three types of captcha and put forward hour-based time analysis, needed to break the captcha systems effectively. The smaller the size of the dataset, the smaller the effort required to break the captcha. The gain, which directly signifies the value of the information available to a website, is an essential component for attackers to target the websites. The sensitiveness of information content that is vulnerable due to compromise of the captcha system and disruption of services together represent the value of information of a site. Accordingly, to obtain a notion of the overall situation on Indian government websites, we analyze the sites and identify the risk. The captcha system for each website is rated based on its robustness and classified into different risk levels. It is important to note that the proposed architectures are tested on Indian government websites. However, its applicability is not limited to that. The architectures can be used to breach the captcha security system on any social websites that use text instructions based on captchas.
Contributions of the article are summarized as follows:
•
Central and state government websites with captcha systems are analyzed to assess the vulnerabilities. We consider a few critical websites consisting of valuable public information. We create a real-time manually annotated dataset to test the effectiveness of each such website.
•
Proposed end-to-end neural network-based novel architecture to solve two types of text instructions–based captchas. The architectures are simple and not too deep (to have low inference time) and yet practical and are capable of solving the text instructions captchas, effectively. We also discuss a useful model to solve text captcha.
•
The study also provides an ecosystem and procedure to rate the overall risk of a captcha system used on a website. The risk is measured on the factors of how easily with the minimum effort (complexity of neural network architecture and manual annotation time) the captcha system can be compromised and the level of gain (value of information through scraping and disruption in critical services) from the attack.
•
We assign a vulnerability rating and also reveal the minimum number of training datasets and corresponding neural network hyper-parameters required to break the captcha system for each of the considered websites with extensive experimentation. In the process, we put forward an extended hour-based time analysis to break the captcha systems effectively.
2 Related Study
In this section, we review the literature related to the work on captcha systems. A large amount of research has been reported to automatically recognize
text captchas, presented in image format using traditional machine learning and deep learning algorithms [
7]. In 2003, Mori and Malik worked on breaking and image-based text captcha, namely the EZ-Gimpy captcha and Gimpy captcha, used by Yahoo [
32]. They provided rules based on which a series of quick tests are performed to hypothesize locations of letters in the image. Next strings of these hypothesized letters are extracted and, finally, the most likely words are chosen from the string, representing text captcha systems.
In 2005, Chellapilla et al. [
5,
6] worked with
Human interaction proofs (HIPs) and compared the abilities of human and computer in recognizing single characters. In the work, results show that computers are as good as or better than humans at single character recognition. However, they assume that characters are segmented successfully and the approximate locations of individual HIP characters are known. In 2006, Yan and El Ahmad [
47] identified fatal flaws in the design of image-based text captcha used by captchaservice.org. The approach has a high success rate by simply counting the number of pixels of each segmented character. However, later, the approach failed when applied on more advance captcha systems.
In 2014, Bursztein et al. [
3] used reinforcement learning approach to segment a captcha using human feedback. They use four major components: Cut-point Detector to determine a potential way to segment, Slicer to obtain proper segmented character, Scorer for OCR, and Arbiter for character prediction. Reference [
3] came up with a single pipeline that uses machine learning–based segmentation and recognition problems simultaneously. Their method removes the need for any hand-crafted component, making the approach generic to new captcha schemes. The approach achieves accuracy varied from 5.33% to 55.22% for the captcha systems used by Baidu, eBay, ReCaptcha, Wikipedia, and Yahoo. Karthik et al. [
23] used
Convolution Neural Network (CNN) for OCR on a real-time Microsoft captcha system in 2015. However, here the captchas are segmented manually, and a success rate of 57.05% was achieved. Similarly to
text captchas, Xu et al. [
46] presented an analysis on usability of motion-based
text captchas. In the work, the authors focus on moving-text object recognition and in the process designined and implemented automated attacks on the captcha. They reported that their GPU-based implementation can decode the moving-captchas faster than humans. Later, Gao et al. [
12] worked with motion-based captchas to discover the weaknesses of it. They proposed that as the camera projection on two-dimensional (2D) objects is constant (unlike 3D objects), it is possible to reconstruct the underlying text by superimposing and aggregating the parts of the object. Their algorithm is able to recognize moving
text captchas with an accuracy of up to 89.2%,
In 2016, Gao et al. used Gabor filters and a k-Nearest Neighbours engine to predict the characters in
text captchas [
11]. The algorithm has been tested for robustness and applied on 10 captcha schemes that include recaptcha, Yahoo, Baidu, Wikipedia, and others. The success rate they achieved is in between 5.0% and 77.2%, with an average accuracy of 34.68% only. Additionally, the work reported that the speed of inference generation for solving the captchas is less than 15 seconds on a standard desktop computer. In 2016, Garg and Pollett [
13] suggested a deep neural network model using CNN and
Recurrent Neural Network (RNN) to predict the character in text captcha. This work presents another model using CNN with multiple softmax. The model they proposed was the first model that required no preprocessing. However, they used a large synthetic dataset to train the proposed models.
In 2017, the team of Gao introduced an algorithm to break Microsoft’s two-layer captcha for the first time [
10]. They suggested a simple and effective segmentation on two-layer captcha. CNN is used to predict each character. They obtained success rate of 44.6% and reported an attack speed of 9.05 seconds on a standard desktop computer. In 2018, Ye et al. [
48] came up with a significant study using GAN to train a base classifier. GAN architecture is used for learning the text captcha distribution from many text captcha schemes. A fine-tuned classifier is trained on top of the base classifier. They reported an inference time of 0.05 seconds on a standard desktop GPU while evaluating the success rate for primary 12 captcha schemes used by Sohu, eBay, Wikipedia, Microsoft, Google, and others. The accuracy range reported in the work for the base classifier is 0–83%, and for the fine-tuned classifier it is 3–92%. However, they do not use the RNN decoder approach to predict the character label; instead, they use a
fully connected (FC) layer. The limitation of using an FC layer for character prediction is the need to adapt a variable-length captcha system. The base solver has been trained using 200,000 synthetic captchas. The creation of synthetic captcha requires extensive effort, as it consists of various security features. For an attacker to develop such a robust base classifier would throw an additional challenge. After training the base classifier, the fine-tuned classifier needs to be trained for each captcha system. The transfer learning [
33] approach is used for the fine-tuned classifier. However, it is worthwhile to note that the convolution, pooling, and the number of FC layers, which they set experimentally, are to be retrained from the beginning in the fine-tuned classifier. For an effective model, the authors have retrained up to four convolution layers (of a total of 5) along with an FC layer for different captcha schemes, which make the process cumbersome, and the solver model architecture becomes complex. Their study includes only image-based
text captchas and the base classifier requires extensive effort to train. Tang et al. [
41] also did a similar extensive study as reported in Reference [
48] by combining two segmentation module and character recognition by CNN network. Like Ye et al. [
48] the authors in the work [
41] discussed resistance mechanisms in
text captchas. They obtained the success rate of attack in between 10.1% to 90%, with an average inference speed 0.45 seconds.
As we study, most of the work on captchas until now solve
text captchas [
7]. However,
text instructions captchas are new and demand a more complex architecture. From the studies, it is clear that the neural network model has faster inference speed compared to the traditional techniques [
10,
41], with also limited overhead and expertise required. Higher inference speed is preferable in the viewpoint of an attacker. In the work, we propose simple but effective neural network pipelines to solve
text instructions–based captcha used in Indian government websites. We also discuss the effectiveness of simple pipeline in breaking the
text captchas.
6 Results
We analyze performance of the seven major Indian government websites using specific captcha solver architecture for the corresponding type of captcha systems used by the website. We first compare accuracy to predicting the character sequence present in the captcha image by applying the training dataset in
Architecture I using static size and dynamic size of decoder LSTM. We observe that, considering the training dataset, the accuracy of the model with dynamic size LSTM is greater than the static size LSTM in every epoch, as shown in Figure
8(a). The dynamic size LSTM decoder model converges quickly compare to the static size LSTM decoder, using the training dataset shown in Figure
8(b). Figures
8(c) and
8(d) represent the prediction accuracy with respect to the number of epochs while applying development and test set, respectively. It is clear from Figure
8 that the dynamic size LSTM yields better and stable performance with a minimum number of epochs in comparison to the static size LSTM using training, test, and development datasets. We demonstrate the comparison using the Bengal Bhumi website (
A.3) captcha dataset and also analyzed the other website captcha datasets, as mentioned in Section
3.2, that assert the same results.
For each website, the model is trained multiple times by varying the size of the training dataset. In the analysis, the focus is on two aspects: One is to obtain maximum accuracy for all the datasets. The other is to approximate the size of the training dataset required to obtain a minimum accuracy of 80% on a test set. Figure
9 demonstrates the accuracy for development and test set with respect to the size of the training dataset (number of data samples) applied on the Indian government website using the
Type I captcha scheme. The accuracy range varies from 83.4% to 98.5%, while working hours for manually annotating a single
Type I dataset varies from 2 hours to 6.5 hours, as mentioned in Table
2. The maximum test accuracy for each of the website captcha datasets is mentioned in Table
5, as shown in Figure
9. The minimum number of training data samples required to obtain 80% accuracy on each government website captcha dataset is also mentioned in Table
5.
The performance analysis of the
Type II captcha scheme used by the Indian Post website is presented in Figure
10. The dataset is tested for both
Architecture IIA and
Architecture IIB, shown in Figures
10(a) and
10(b), respectively. The maximum accuracy obtained for both model architectures is 93.52%. However, it is important to note that
Architecture IIB performs better with a minimum number of training data samples in comparison to
Architecture IIA. Better performance gain with a smaller number of training data samples in
Architecture IIB is obvious, as learning by the network is taking place from two fronts (output prediction label and expression text label). However, with the increase in the number of training data samples, performance of both the architectures is relatively similar. From the viewpoint of an attacker,
Architecture IIB is more significant as it is still effective with very few number of training data samples.
The performance analysis of
Type III captcha system using
Architecture III is shown in Figure
10. Figure
10(c) depicts the value of test accuracy with respect to the number of training data samples for the
Type III IRCTC (
A.7) captcha dataset. High test accuracy 96.07% has been achieved with the real-time dataset. However, there is a class imbalance problem in the IRCTC (
A.7) captcha dataset. To test the effectiveness of
Architecture III, we analyzed the model with the synthetic dataset of
Type III captcha. The performance of the test result on the synthetic dataset is shown in Figure
10(d), where the maximum accuracy of more than 93% is obtained by using
Architecture III. The synthetic dataset is used only for validation of the proposed architecture and is not used for the report analysis of government websites, for obvious reason.
The number of epochs required for convergence in each training phase is dependent on the captcha dataset and the corresponding architecture. We keep few of the important hyper-parameters constant during training for all the models and using the datasets, which are mentioned in Table
3. Inference time required to solve a captcha using the corresponding architecture on a low power desktop GPU is given in Table
4. It is clear from Table
4 that all the architectures have a low inference time and therefore satisfy the terms with the viewpoint of an attacker.
7 Discussions
In this section, we discuss various parameters upon which the vulnerability of the websites has been tested. We focus on the human effort for annotation and the architecture complexity to determine the overall risk of the current captcha system. In the latter part of this section, we discuss the inference we draw from the work with a few suggestions for improvement. Justification of attempting the work has been explained considering related state-of-the-art studies.
7.1 Analysis
We calculate the robustness of the captcha system to rate the vulnerabilities for each particular website. Robustness of the captcha system is directly proportional to the human efforts required to solve the system. With the context of deep architectures, the robustness of the captcha system mainly depends on two important factors. The factors are the complexity of architecture and the number of manually annotated captcha required to train an effective captcha solver. The complexity of an architecture is measured on the number of encoder-decoder module used in the respective architecture as shown in Table
2. The human effort required to annotate each captcha type differs and depends on factors like the presence of
text instructions, the maximum number of characters in the label, label type, and type of operation as per instruction. Working hours for annotating each captcha dataset is specified in Table
2. The working hours are considered for evaluation of the complexity of the captcha system. The other important aspect is the accuracy; the minimum 80% of test accuracy has been considered as a threshold for an effective captcha solver model. The minimum number of annotated captcha required to obtain 80% accuracy for each model is the evaluation criteria, which is considered in this article.
The
Robust score (
\(\gamma\) ) for a captcha system is evaluated by adding the
Architecture complexity value (
\(\alpha\) ) and manual annotating working hours required to achieve 80% accuracy for the model (
\(\beta\) ), defined in Equation (
13). Table
5 shows the
Robust score for each dataset and the corresponding model architecture. The higher
Robust score represents a more complex and hard-to-break captcha system,
where
\(\alpha\) is the number of the Encoder-Decoder unit in the solver architecture and
\(\beta\) is the approximate working hours require to manually annotate the data-points to reach 80% accuracy.
The risk to a website is dependent on the importance of that website and how vulnerable it is to attacks. The risk to a website is measured by subdividing it into two major categories (i) intrinsic properties corresponding to the vulnerability of the captcha system and (ii) the extrinsic properties to launch an attack.
The first property (i.e., intrinsic properties corresponding to the vulnerability of the captcha system) is inversely proportional to the Robust score, which measures the strength of the captcha system, whereas the extrinsic properties, i.e., the non-captcha factors, which attract or discourage an attacker from launching an attack on a website, is determined through the calculation of the
Website value score. The
Website value Score of a government website is dependent on three properties, which are whether the website has important scraping content, is susceptible to a DoS attack, and has complex fields in the web page. Complex fields on the web page refer to the text boxes, which are hard to predict by random guessing. Equation (
14) represents the formulation of
Website value score, and the values for the corresponding websites are shown in Table
6,
where
The final Risk score for each of the considered website is shown in Table
6 and calculated using Equation (
15),
7.2 Inferences and Suggestions
Based on the analysis, a few inferences have been derived from Tables
5 and
6. The solver for
Type II and
Type III captcha has more complex architecture and tends to have a high
Robust score. However, the system using
Type I captcha has high variance on the
Robust score. It has been observed that the captcha systems using variable length labels are hard to converge and have a high
Robust score, as shown in Table
5. On the contrary, the fixed-character
Type I captchas are easy to train and have a low
Robust score.
From the analysis, it is clear that a well-designed variable character Type I captcha system works at par with Type II captcha. But the Type III captcha is the most complex and most difficult to break. There is still some room for improvement while designing Type I captcha by making changes in the images. The Type II and Type III captchas both have a considerable amount of space for improvement. While designing, the possible improvement is not only limited in the images—there is a scope of improvement in text instructions and expression. One such scenario is to use a more complex mathematical expression. In the original captcha dataset of the website, the only arithmetic operations are addition and subtraction. An extended operation like multiplication, division, or any other mathematical expression that is easy to evaluate for humans is encouraged.
The combination of natural language and image in captcha could be the future of the captcha system, and a lot of improvement is possible. But the focus of our work is to explore and escalate the vulnerabilities involved in the government websites where the current captcha systems are deployed. The proposed architectures are able to solve different types of captcha systems with high efficiency using the original captcha dataset of the websites.
7.3 Justification to State of the Art
In this work, we use neural network models to solve text and
text instructions–based captcha. The neural network models are highly effective in solving
text captchas, as reported in the previous studies [
41,
48]. The neural network models are easy to implement because of its black box nature, and with minimum effort and domain knowledge could be used by an attacker. The other significant advantage of using the neural network models is the inference speed of neural network pipelines. The work reported by the team of Gao [
10,
11], Ye et al. [
48], and Tang et al. [
41] provide enough practical evidence that a neural network model has much faster inference speed compared to the traditional methods. Therefore, the neural network pipeline is equally effective in solving text and
text instructions captchas, which we analyzed and presented in thisarticle. From the viewpoint of an attacker, the inference speed always plays a pivotal role when a web-server is highly vulnerable with the frequency of attacks. In the study, we observe that the
IRCTC website (
A.7) that provides critical service during
Tatkal time could be highly affected with fast attacks. Along with critical service disruption, data extraction is faster with a high-speed inference model. Attacker plans to scrape data from a website by filling text fields in the web-page with a well-defined random value and solving the required captcha. Therefore, to scrape data from a large government website, the neural network models with low inference time is favored. In this article, we attempt to develop and propose neural network models for its effectiveness and inference speed, which contributes to the factor for solving the captchas.
The model used for
Type I captcha is similar to the work proposed by Garg and Pollett [
13] and Ye et al. [
48]. Both studies have a similar basic CNN structure known as LeNet [
26]. Garg and Pollett considered a complete synthetic grey captcha image dataset to train the model. The synthetic dataset contains 1 million simple images of fixed-length 5, 2 million complex images of fixed length 5, and 13 million images of variable length. However, it is not feasible to create such a large dataset on real-time captcha system for each website. Though the architecture is simple, they overestimate the requirement of the size of the dataset. Our analysis shows that the architecture could also be effective, using only a few thousand data points. Our model is a modified version to the one proposed by Garg and Pollett, consisting of three convolution layer instead of two. In their CNN model, they did not use batch normalization, which reduces the need for a huge number of training samples and training time. In this article, we demonstrate that the fine-tuned architecture could be adequately trained from scratch to break a captcha system with a limited amount of training datasets. From the analysis, we exhibit that the model is equally effective with color captcha images. Another work, as proposed by Ye et al. [
48] (explained in
2) for
Type I captcha uses GAN architecture, has a major bottleneck in the viewpoint of an attacker. The attacker needs to put an extensive amount of effort to design the base classifier by considering various captcha scheme. Moreover, an attacker has to effectively implement the complex fine-tuned process of captcha layers for each new captcha system. The number of captcha layers that need to be retrained adds another new hyper-parameter to the task. The model also uses a static FC layer for prediction of captcha labels in place of a dynamic decoder, which makes the architecture less adaptable to a new captcha scheme with variable length. In this article, the model considered for the analysis of
Type I text captcha is simple, effective, and easy to implement. The model architecture can be implemented for any captcha system from scratch and independently from any other captcha scheme, unlike the approach used by Ye et al. [
48], which is dependent on various other captcha schemes used for training of the base solver.
The
text instructions–type captchas are relatively new and less explored. To the best of our knowledge, there is no solution architecture proposed to date for
text instructions captcha. A novel architecture pipeline with faster inference process has been proposed in this article for
text instructions (
Type II and Type III) captchas. The architecture is easy to understand and effectively use multiple encoder-decoder models to predict the correct captcha answer. It is important to note that for
Type II captcha, the statistical or rule-based system could be combined with the neural network model for correct label prediction, where the rule-based system can be applied for interpreting the instructions. However, the system fails when the number of instruction is large enough. The proposed models will thrive in such a situation, as the architecture incorporates the language model or language understanding approach [
30,
38], which eliminates the rule-based reasoning process. The use of a neural network pipeline architecture also significantly increases the inference speed. In
Type III text instructions captcha, the
text instructions needs to be extracted from the image along with the text expression. The LSTM attention module helps to predict the word, which, in turn, is used to predict the
text instructions. An attention module also plays a vital role in differentiating between expression and
text instructions present in the single image. Having a single pipeline is the major challenge in solving such captcha and the model proposed in the work effectively satisfy the need, demonstrating the novelty of the work.
8 Conclusion
We for the first time, we systematically analyze the captcha vulnerabilities associated with the top government websites in India. The robustness of a captcha system is tested based on the criteria of architecture complexity used in solver models and the working hours required to manually annotate the dataset. We use original captcha data-points from each of the websites to test the associated model. The proposed effective solver models achieve a success rate of more than 80% in all the websites we consider for analysis. The result clearly indicates that a few of the captcha system models are highly vulnerable and have critical status. It is also clear from the experiment that none of the website captcha systems is unbreakable and thus require valuable attention.
The major contribution of our work is to propose a novel neural network pipeline to solve text instructions–based captcha. The instruction-based captcha has mainly two variations, one with instruction text embedded with the expression in an image. The other one features instruction text in text format with expression specified in the image. We propose three pipelines to solve both text instructions–based captcha schemes. We use more than one encoder-decoder deep learning module to combine the text and image information available in the captcha. A combination of CNN, LSTM, and LSTM with attention network was simultaneously used in the proposed architecture. A list of proven hyper-parameters for each of the models to re-create the work is provided. It is important to note that the proposed architectures for solving text instructions–based captcha has a potential to be useful in the applications that closely use text and image.
We also come up with an ecosystem and procedure to rate the overall risk of a captcha system used on a website. The extensive results for every selected website are provided, which determines the human effort required to solve the captcha system. The captcha system is rated based on their robustness. Each of the websites is thoroughly checked to determine the importance of the captcha system, and an Information value score is provided based on that. The Information value score and the robustness determines the overall risk involved in the organizational websites. The alarming prospect deduces from the study is how easily with very minimum working hours the captcha solver can be built from scratch. We hope the proposed work can inspire government organization and research community to revisit their designs of captcha system used in the websites.