1. Introduction
Phishing is a highly dangerous attack that targets individuals, organizations, and even nations. It involves social engineering tactics, where attackers impersonate legitimate entities to trick victims into disclosing sensitive information. According to the report released by the Anti-Phishing Working Group (APWG) [
1], a staggering 1,286,208 phishing attacks were recorded in the second quarter of 2023. The report shows that 23.5% of all phishing attacks target the financial sector, making it the most attacked sector overall. Social engineering threats and attacks are the top concern for individuals and the second concern for many organizations [
2]. Attackers use various deceptive techniques to gain access to sensitive information, such as login credentials, credit card details, or personal information. Social engineering is often the initial step in a cybercriminal’s attack plan, and in approximately 82% of cases, the spread of malware within a network begins with a phishing message [
2]. Various communication methods are used by phishers to carry out their attacks, with the most common methods being email messages, social network messages, text messages, and phone calls.
Email phishing is a general term that refers to emails with malicious intent. A well-known example of an email phishing attack occurred in 2018 during the FIFA World Cup. Attackers targeted football fans by sending phishing emails promising recipients free tickets to Moscow, the hosting city of the 2018 FIFA World Cup. By tricking individuals into opening these phishing emails and clicking on embedded links, criminals successfully gained access to personal data from unsuspecting users.
Detecting phishing emails is essential to combating this type of attack and preventing cybercrime. Many organizations focus on strengthening their email security measures using a combination of methods. One way is the implementation of subdomain controls, which involves creating a separate domain specifically dedicated to email security to better protect against email-based attacks. In addition to this, user education and analysis of the history of phishing attacks are crucial for ensuring the security of individuals and organizations.
In the literature, classical approaches for phishing detection fall into two categories:
blacklists and
signature-based techniques [
3].
Blacklisting is the act of making a list of suspicious resources used in previous phishing attacks. New suspicious contents can be checked against blacklists to confirm their validity. Unfortunately, due to the short lifespan of phishing links and the rapid creation of new ones, managing blacklists becomes difficult. Additionally, a single character change in the URL causes the website to be unrecognized by the blacklist. On the other hand, the
signature-based approach focuses on utilizing features associated with the phishing act gathered from email addresses, links, URLs, and webpages in combination with rule setting to detect phishing attacks. Features of a newly accessed source are compared with with known phishing features identified from previous experiences. Although this approach is more efficient than list-based approaches and more effective in detecting zero-day attacks, it suffers from high false-positive rates.
Looking into the literature, we identify the following research gap. Traditional phishing detection approaches rely on human effort to analyze phishing email features, such as the sender, subject line, and contents. However, as the complexity of phishing attacks increases, these approaches are no longer adequate. Recently, approaches based on deep learning (DL) and machine learning (ML) have demonstrated the ability to overcome the limitations of traditional phishing detection methods [
4]. ML algorithms can be used to train models that can detect phishing emails. Such models can learn phishing patterns and characteristics from large phishing datasets. Prior to training, important features related to phishing activity need to be first identified. This usually requires field expertise and a careful selection of essential features that result in efficient detection algorithms.
In contrast to ML algorithms, DL is capable of automatically extracting important features directly from raw data. Currently, deep neural networks are being used efficiently in several domains due to their state-of-the-art performance. In cybersecurity, researchers have demonstrated the potential of DL in tackling many cybersecurity problems [
5]. However, more work is required to examine the robustness of deep neural networks for detecting email phishing [
6]. DL algorithms, particularly convolutional neural networks (CNNs), long short-term memory (LSTM), and gated recurrent unit (GRU) models, showed promising results on different classification tasks including question classification, text categorization, and sentiment analysis and classification [
7], among many others.
The problem context of the current study is phishing detection in emails. The problem is framed as a document binary classification task, treating every email as one document. Our goal is to classify emails into either
phishing or
legitimate. Our research question is how effective are deep learning models in detecting email phishing compared to existing methods? In particular, how effective is augmenting a CNN with recurrent layers in improving phishing detection performance? In order to achieve our goal, the objective of this study is to utilize DL architectures, including CNNs, LSTM, GRU, and and their variations. We define and measure the success of phishing email detection models in terms of improved performance in terms of standard metrics (precision, recall, accuracy, and area under the ROC curve). In this study, first, eight one-dimensional CNN models of various depths were trained using Spam Assassin [
8] and Phishing Corpus [
9] datasets. These models are collectively referred to as 1D-CNNPD models. Second, we augmented our base 1D-CNNPD model with LSTM and a GRU (and their variations) to train four additional models with the goal of improved performance. We call our augmented models Advanced 1D-CNNPD. LSTMs and GRUs are designed for sequential data and capturing temporal dependencies. Recent research suggests that augmenting a CNN with recurrent layers improves the phishing detection performance [
10]. Deep neural networks are expected to enhance phishing email detection as they have superior abilities in terms of capturing hierarchical representations of features, considering both low-level and high-level abstractions. The performance of the twelve models for phishing detection was evaluated and compared with that of other similar models in the literature. In general, the performance of our models is comparable to state-of-the-art models. The 1D-CNNPD augmented with Bi-GRU outperformed advanced deep learning and machine learning phishing detection algorithms, achieving 100% precision, 99.68% accuracy, an F1 score of 99.66%, and a recall of 99.32%.
The main aim of this study was to investigate the potential of using deep learning for email phishing detection. In addition, we wished to examine the issue of deep learning model complexity in contrast with performance. Very deep models for natural language processing are complex and require vast resources and long training times to achieve excellent results, however, we hypothesize that such is not required for phishing detection. As such, we wish to study the effect of increasing model complexity, i.e., depth, for the problem of phishing detection. The contributions of this work are as follows: (1) we assessed the effects of varying the convolutional neural network depth on model performance in the context of phishing detection, (2) we developed a lightweight model that achieves excellent results, and (3) we recommend various interesting areas for future research.
The proposed models can assist companies in providing a higher level of security against various types of email phishing attacks by detecting distinct features of such incidents and subsequently minimizing the occurrence of data breaches. Our models can be installed on the Edge Network Operation (ENO) of email. As shown in
Figure 1, the framework receives a group of emails to evaluate and dispatches them to the email client.
The rest of the paper is organized as follows: In
Section 2, we review the related literature on email phishing detection.
Section 3 discusses the details of our 1D-CNNPD and Advanced 1D-CNNPD models. Then, we present experimental settings and results in
Section 4. The impact and findings of this study are discussed in
Section 5. Finally,
Section 6 concludes the paper.
5. Discussion
The main goal of this study was to investigate how email phishing detection could be improved using DL-based approaches. Email phishing detection using deep learning is rapidly evolving. This stems from the potential of deep learning to overcome the limitations of traditional methods and enhance the accuracy of detection. CNN, LSTM, and GRU architectures are capable of analyzing the contents and structure of phishing emails, as demonstrated in various studies [
3]. However, a major success measure for any phishing detection model is the ability to detect zero-day phishing attacks with low false-positive rates.
A total of twelve CNN-based models were trained to detect phishing emails. Our problem is formulated as a binary classification problem of documents, where it is required to classify each email (document) as being a phishing or legitimate email. Our results demonstrate that 1D-CNNPD models are very robust in extracting features of resilient phishing without using any hand-engineering feature extraction method. Experimenting with various depths brought to our attention the widely discussed issue of whether deeper models can yield more accurate results. Our findings show that performance degrades as the number of convolutional layers increases. This could be the result of model overfitting. We note that in our experiments, some deeper models exhibit excellent performance; nevertheless, they display a higher standard deviation score, suggesting that their efficacy may not be consistent across various iterations. Our observation aligns with findings in the literature, indicating that increasing model depth tends to exhibit an initial improvement in performance followed by degradation [
53,
54].
Historically, traditional machine learning algorithms have been extensively employed to address phishing detection and continue to be in use today. Despite being categorized as shallow learners, they proved to be effective in certain settings. This research illustrates the superiority of deep learning over traditional shallow learning in the realm of phishing detection. The intricate architectures of deep learning models allow automatic extraction of features and contribute to a more effective learning process compared to their shallow counterparts. Two well-known datasets were used to train our models: Spam Assassin and Phishing Corpus datasets. Despite being collected between 2004 and 2007, they remain the most widely used in the literature [
39]. Recently, there have been efforts to collect phishing datasets that mimic the evolution of phishing attacks. The dataset compiled by Bountakas and Xenakis [
40] contains 35,511 emails. However, only 3460 are phishing emails, of which 1472 were collected between 2015 and 2020. Although it is a more recently published dataset, it is severely imbalanced, and the phishing emails were not recent. As such, it provides no clear advantages for us over the more commonly used datasets. The lack of representative datasets in the field of phishing emails is still an ongoing concern for researchers.
Our results indicate that the 1D-CNNPD with Bi-GRU augmentations outperforms DeepAnti-PhishNet in terms of accuracy. It is crucial to highlight that DeepAnti-PhishNet exhibits a bias toward classifying legitimate emails due to training on an imbalanced corpus. In comparison to THEMIS, the 1D-CNNPD with Bi-GRU combination demonstrates a higher F1 score. For highly concealed phishing emails, where the email body exhibits an extremely high similarity to legitimate emails, the attention mechanism in THEMIS assigns a higher weight to the email header than to the email body. During THEMIS training, email header information from an open-source dataset was utilized, which could be the reason for the improved accuracy.
In practice, the implementation of email phishing detection can be tailored to the specific requirements and constraints of the system. Two common deployment scenarios are at the email server or at the edge of the network. Implementing the detection component directly on the email server enables real-time analysis and the immediate identification of potential phishing emails. Alternatively, deploying it at the network’s edge provides an added layer of security, with all incoming emails passing through the edge device for analysis before being forwarded to the email server or the recipient’s mailbox.
The phishing detection technologies that are now in use have a number of practical limitations. Zero-day attacks present a serious problem as they can continuously introduce new phishing patterns that the system has never seen before. Another challenge is spear phishing, which involves highly targeted and customized attacks that do not match known phishing patterns. Social engineering techniques are becoming more common, which further complicates things by taking advantage of human vulnerabilities and making it harder for automated systems to detect malicious intent based only on technical factors. Although AI-based systems have demonstrated tremendous success in many fields, an important challenge that remains to be addressed is algorithmic bias resulting from limited datasets. In the context of phishing detection, training datasets are inherently limited, exhibiting an imbalanced class distribution. Malicious and phishing emails are typically underrepresented, and this imbalance may lead to biased predictions with potentially unwanted consequences.
Despite all developments in deep learning, research shows that it has not been extensively studied in the context of phishing detection. One area to investigate is the effect of hyperparameter fine-tuning algorithms to ensure the robustness of deep learning architectures for phishing detection. Many optimization algorithms, such as random search, grid search, and Bayesian optimization, could be explored. Another requirement for future phishing detection models is to have representative datasets, as class-imbalanced datasets can lead to poor and biased detection performance. Although many class-imbalanced algorithms exist, obtaining reliable and precise results is still challenging. This is particularly challenging in the case of multi-label and multi-class phishing detection. Thus, future research needs to focus on adapting class-imbalanced algorithms to the ever-evolving nature of phishing patterns. Emerging technologies pose more challenges in the detection of cyberthreats, including phishing attacks. Cybercriminals are utilizing AI technologies, such as generative AI, to create content that resembles a known contact’s tone and style, making it more difficult for phishing detection solutions to recognize. Additionally, phishing attacks can become more targeted as a result of the growing amount of behavioral data being collected. Attackers can track user activity and execute customized attacks through AI automation. Finally, deep learning is sometimes thought of as a “black box”, which means that it might be difficult to analyze or comprehend the inner workings of the model and the particular features that influence its predictions. However, AI explainability is an emerging research topic and there is ongoing work to explore the explainability of deep learning methods in the context of phishing detection, a topic that is worth investigation.
Phishing attacks present implications on various levels, including theoretical, managerial, and social. Many theoretical frameworks have been proposed to predict phishing susceptibility. Research suggests that individual characteristics like age, gender, and technological proficiency do not correlate with a person’s susceptibility to phishing. Rather, training and anti-phishing education can help control a user’s response to a phishing attack. Institutional and business IT managers should be mindful of the harmful aspects of phishing attacks to organizations that would result in sensitive data breaches, loss of data and intellectual property, reputation damage, customer churn, and monetary losses. Protecting organizations from phishing attacks requires a combination of countermeasures, including employee education, technical solutions, and compliance with policies and relevant laws. Phishing attacks have far-reaching societal effects since they diminish society’s trust in using technology, making people less confident in conducting critical transactions online or assisting others.