research-article

Open access

X-Phishing-Writer: A Framework for Cross-lingual Phishing E-mail Generation

Authors:

Shih-Wei Guo,

Yao-Chung FanAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 23, Issue 7

Article No.: 102, Pages 1 - 34

https://doi.org/10.1145/3670402

Published: 26 June 2024 Publication History

PDF eReader

Abstract

Cybercrime is projected to cause annual business losses of $10.5 trillion by 2025, a significant concern given that a majority of security breaches are due to human errors, especially through phishing attacks. The rapid increase in daily identified phishing sites over the past decade underscores the pressing need to enhance defenses against such attacks. Social Engineering Drills (SEDs) are essential in raising awareness about phishing yet face challenges in creating effective and diverse phishing e-mail content. These challenges are exacerbated by the limited availability of public datasets and concerns over using external language models like ChatGPT for phishing e-mail generation. To address these issues, this article introduces X-Phishing-Writer, a novel cross-lingual Few-shot phishing e-mail generation framework. X-Phishing-Writer allows for the generation of e-mails based on minimal user input, leverages single-language datasets for multilingual e-mail generation, and is designed for internal deployment using a lightweight, open-source language model. Incorporating Adapters into an Encoder–Decoder architecture, X-Phishing-Writer marks a significant advancement in the field, demonstrating superior performance in generating phishing e-mails across 25 languages when compared to baseline models. Experimental results and real-world drills involving 1,682 users showcase a 17.67% e-mail open rate and a 13.33% hyperlink click-through rate, affirming the framework’s effectiveness and practicality in enhancing phishing awareness and defense.

1 Introduction

Starting from 2021, global cybercrime is expected to cause increasing business losses, growing by 15% annually. It is estimated that by the year 2025, the total amount will reach an astonishing $10.5 trillion on an annual basis [Ventures 2020].

The IBM Cybersecurity Intelligence Index [IBM 2014] reveals that 95% of security breaches stem from human errors. Notably, phishing e-mail attacks have surged in the last decade by capitalizing on these errors. Such attacks involve sending deceptive messages mimicking reputable institutions like banks. Often conveying urgency, these messages coax recipients into visiting counterfeit websites. On these sites, victims are manipulated into divulging personal data under the pretense of updating information or logging in. Typically, victims overlook the authenticity of the site, facilitating data sharing. AtlasVPN [AtlasVPN 2021] reports an astonishing 1,726% growth in daily identified phishing sites from the last decade, with 2020 averaging 5,789 instances. Moreover, there was a 25% surge in 2019 alone. This underscores the urgent need to address phishing attacks as a paramount concern in cybersecurity research.

To enhance defenses against phishing attacks, Social Engineering Drills (SEDs) are indispensable. These drills are deliberate exercises that replicate the tactics employed in real-world social engineering attacks, assessing an organization’s susceptibility to such threats. The dual goals of SEDs are to pinpoint vulnerabilities within the security framework and employee preparedness and to bolster overall vigilance against these incursions.

However, developing an effective SED program is fraught with obstacles, particularly in the creation of SED e-mails. The conventional approach relies on manually curated e-mail templates. While straightforward, this method has its drawbacks: e-mails from these templates often get caught by spam filters due to their uniformity. Mass dispatch of identical e-mails heightens the likelihood of activating spam-filtering mechanisms. This not only diminishes the drill’s efficacy but also aids recipients in recognizing the exercise’s intent, thereby reducing its real-world applicability. On the other hand, crafting a large number of personalized e-mails presents its own set of challenges, being both resource-intensive and difficult to execute at scale. Consequently, the quest for a method to automatically generate varied and convincing phishing e-mail content becomes paramount.

The process of generating phishing e-mails exemplifies a task constrained by limited resources, primarily because e-mails are sensitive and typically cannot be collected in bulk. This sensitivity contributes to the scarcity of available data for training and testing automated systems. Currently, to the best of our knowledge, the landscape of publicly accessible phishing e-mail datasets is notably sparse, with only two prominent datasets available: the Nazario phishing e-mail dataset¹ and the Enron e-mail dataset.² This dearth of data significantly complicates the task of automating the generation of phishing e-mails. The challenges in automating this process are manifold, reflecting the complex and nuanced nature of creating convincing and diverse phishing content under these constraints.

—

Generation Control: E-mails can be crafted by specifying keywords. In detail, our aim is to design a language generation model. This model will accept a set of user-defined keywords as input and produce an e-mail containing those keywords. For example, as shown in Figure 1, when we input the keywords “bank,” “account,” and “activity,” the goal is to generate an e-mail incorporating these specified keywords.

—

Cross-language Generation: International companies may have employees distributed across different countries, using various languages. For enterprises of this scale, Cross-language Phishing e-mail generation becomes essential.

—

Low-resource Requirement Given the multilingual and cross-lingual setting, it’s anticipated that relying on a large volume of labeled data for e-mail generation is impractical. Our investigation reveals that there is currently a limited amount of publicly available English e-mail data. For many languages such as Chinese, Japanese, and Vietnamese, there are no readily accessible e-mail corpora that can be directly employed.

—

In-house Use Only: The data privacy concerns are brought about by the utilization of external language models. With recent notable advancements of ChatGPT, an intuitive approach might involve using ChatGPT for generating phishing e-mails [Roy et al. 2023]. However, conducting SEDs often involves touching upon sensitive data or even encompassing business secrets, which brings concerns for using ChatGPT for SED email generation.

Fig. 1.

In response to these challenges, this study introduces X-Phishing-Writer. X-Phishing-Writer is distinguished by several innovative features that directly address the challenges of data scarcity and the need for diverse, convincing phishing content. First, it is a cross-lingual, Few-shot phishing e-mail generation framework, uniquely capable of utilizing a single dataset in one language to facilitate the creation of e-mails in multiple languages. This feature significantly broadens the applicability of the framework across different linguistic contexts. Second, X-Phishing-Writer introduces the ability to automatically craft e-mails from user-specified keywords, streamlining the generation process and enabling rapid, targeted phishing simulations. Lastly, the framework is grounded on a lightweight, open-source language model, making it an ideal solution for internal deployment by corporations.

The contribution of our studies can be summarized as follows:

—

From a technical perspective, as far as our knowledge extends, our X-Phishing-Writer stands as the first cross-lingual generation study that integrates Adapters [Pfeiffer et al. 2020b] into an Encoder–Decoder architecture for phishing email generation.

—

Through experimental validation, X-Phishing-Writer demonstrates superior performance in cross-lingual generation tasks related to phishing e-mails across 25 distinct languages when compared with existing models. Human evaluation is also conducted to assess the quality of generated content, with results indicating that our approach consistently outperformed baseline models in terms of generation quality.

—

Furthermore, this study conducted real-world social engineering drills, involving a total of 1,682 users who received e-mails generated by X-Phishing-Writer. The experimental results revealed an average e-mail open rate of 17.67% and a hyperlink click-through rate of 13.33%. These outcomes surpass the 11.9% click-through rate reported in the study by Lin et al. [2019], demonstrating the quality and feasibility of our generation approach.

2 Related Work

In this section, we begin by reviewing the relevant research on SEDs and automated phishing e-mail generation. We categorize the related studies into two main categories: (1) research on social engineering drills (refer to Section 2.1) and (2) research on phishing email generation (refer to Section 2.2). Subsequently, we delve deeper into the discussion of natural language conditional generation, multilingual models, and research related to transfer learning in the rest of the sections.

2.1 Defense Strategy with Social Engineering Drill

The success of phishing attacks can be attributed to several reasons. For instance, attackers gain an advantage by conducting reconnaissance on users and/or the companies they work for. Attackers communicate with users using familiar terminology or colloquial phrases, thereby enhancing the “legitimacy” of their communication and becoming a trusted entity to the users.

A study by Khonji et al. [2013] investigates phishing mitigation techniques, including detection, offensive defense, correction, and prevention. In research by Desolda et al. [2021], the authors categorized recent research topics in phishing attacks, encompassing policies, organization susceptibility, user awareness, modeling, and defensive mechanisms.

The following five research topics were identified:

(1)

Policies: The objective here is to ensure the protection of sensitive information and the overall security of systems. An often-mentioned example is the password policy, which is used to establish and manage secure authentication, ensuring proper user identity verification.

(2)

Organization Susceptibility: Various factors affect an organization’s susceptibility to phishing attacks. For example, spear-phishing techniques involve sending seemingly legitimate e-mails from known or trusted senders to specific organizational employees to deceive them.

(3)

User Awareness: One of the key aspects considered is the users’ awareness of phishing attacks. Understanding attack types and their potential forms helps users better recognize and address potential risks.

(4)

Modeling: By examining two user roles, “attacker” and “victim,” the study delves into the motivations of cybercriminals and how they cleverly exploit human factors and psychological strategies to successfully launch attacks. This research method provides a more comprehensive perspective, allowing a deeper understanding of the interaction and influence among various participants in the cybercrime ecosystem.

(5)

Defensive Mechanisms: Protecting network security not only relies on defense mechanisms based on technical features but also emphasizes the importance of human-based defense mechanisms. This viewpoint suggests that we cannot solely rely on the complexity of technology and the power of algorithms but need to focus on human behavior and interaction in the online environment.

Our research focuses on the discussion of defensive mechanisms, particularly human-based ones. The most common solution based on human defense mechanisms is social engineering training to prevent and detect online phishing fraud. For example, Sheng et al. [2010] proposed an online game called PhishGuru that educates users about such attacks when they use e-mail. When users click on URLs in simulated phishing e-mails, it sends a training message to teach users how to avoid falling for phishing attacks. Additionally, they concluded that participants aged 18 to 25 are consistently more susceptible to phishing attacks than older participants, suggesting age plays a role in phishing e-mail susceptibility. Sheng et al. [2007] introduced Anti-Phishing Phil, a system that trains users to identify phishing URLs through gamification. They found that participants who played Anti-Phishing Phil were better at recognizing fraudulent websites than other participants.

In another approach, traditional human-based training usually involves security experts sharing cybersecurity stories. However, Wash and Cooper [2018] compared traditional fact and advice training with training that conveys the same lessons through simple stories. They found that fact and advice training was more effective than no training for users.

The study by Mohebzada et al. [2012] conducted two large-scale phishing experiments targeting university members. The results indicated that nearly 2.05% of the 10,917 users failed to recognize phishing e-mails despite warnings from the IT department. Some users still fell victim to the attacks. Interestingly, the results showed that students were more susceptible to phishing attacks compared to faculty and staff. While this is an early study, its experimental context aligns with our research, making its findings worth considering.

In addition to categorizing recent research topics related to cybersecurity and human factors, Desolda et al. [2021] compiled solutions aimed at reducing the success rate of phishing attacks driven by human factors. These solutions encompass User Interface, Attitude, Behavior, Psychological Aspects, Knowledge, Education, and Training, as well as Framework, Models, and Taxonomies. We are particularly interested in strategies (2) and (3):

(1)

User Interface: Recommends modifying the user interface to encourage specific behaviors or alert users about their risky actions, for example, increasing the visibility of legitimate icons, inserting interface messages or elements, or using on-screen messages to make users more aware of the consequences of their actions.

(2)

Attitude, Behavior, and Psychological Aspects: Focuses on user behavior, attitudes, and psychological aspects in personal and professional contexts. Avery et al. [2017] analyzed the psychological aspects that make users vulnerable to attacks like phishing, fraud, baiting, clickbait, and repackaging. They identified linguistic components in phishing attacks, such as instant gratification (e.g., easy money-saving opportunities), victim identification (e.g., wanting to help those in need), reputation (e.g., relying on a company’s reputation to deceive recipients), and negative consequences (e.g., warning victims that their accounts will be terminated if they don’t comply).

(3)

Knowledge, Education, and Training: Focuses on introducing various cybersecurity topics to online users to enhance their knowledge. This strategy emphasizes training plans or simulations to increase users’ understanding of online security issues. Lim et al. [2016] proposed a security training system to prepare users for e-mail phishing and phone SMS attacks. The proposed training included sending e-mails or messages with virtual phishing URLs, similar to real phishing attacks, and issuing warnings to users when the webpage is opened to explain the potential risks associated with such messages and URLs. The evaluation of the system showed that the click rate for virtual phishing e-mails and threat links decreased from 16% to 12% over time. Their conclusion was that the proposed security training system can provide training on security threats in phishing e-mails to individual users.

(4)

Framework, Models, and Taxonomies: Suggests building models to analyze employee behavior, handle their security policies, and correct their erroneous behavior when necessary. The ultimate goal is to reduce the effectiveness of phishing attacks targeting organization employees.

Past research supports addressing cyberattacks from a human factors perspective, with an emphasis on training employees as the optimal solution. However, prior studies have been lacking in terms of selecting phishing e-mails, especially those related to specific topics. As shown in Table 1, these e-mails are often chosen manually or synthesized from templates, which can be easily recognized and thus diminish the effectiveness of training. This is where our research comes into play by integrating natural language generation (NLG) and training strategies designed for human factors, automating the process of generating topic-specific phishing e-mails for different age groups, genders, or professions to enhance employee awareness.

Table 1.

Authors	Social Engineering
Authors	Phishing E-mail	Scale	Platform	Strategy
[Sheng et al. 2007]	Template	Large	Game	Topic
[Sheng et al. 2010]	Template	Lab	Game	Topic
[Mohebzada et al. 2012]	Template	Large	Email System	Topic
[Ferreira and Lenzini 2015]	Template	Large	Email System	Psychological Principles
[Lim et al. 2016]	Template	Large	Plugin	Topic
[Baki et al. 2017]	Template	Lab	Email System	Random
[Wang et al. 2019]	Template	Large	Game	Topic
[Lin et al. 2019]	Template	Large	Email System	Psychological Principles
[Ours 2023]	Neural Network	Large	Email System	Psychological Principles

Table 1. Studies on Social Engineering Drills

2.2 Phishing E-mail Generation

In the past, Phishing E-mail Generation has primarily been categorized into two methods: one using templates or rules (linguistic methods) to generate e-mail content [Mohebzada et al. 2012; Ferreira and Lenzini 2015; Chen and Rudnicky 2014; Lin et al. 2019] and the other utilizing corpus-based statistical techniques. Chen and Rudnicky [2014] experimented with template-based random generation of e-mail content based on sender style and subject. Their approach involved two-stage random e-mail generation, where the first stage constructs the e-mail based on sender style and subject, and the second stage generates e-mail details and content based on the given goal. This opened a new milestone for natural language e-mail generation.

With the continuous evolution of deep learning techniques, researchers have shifted their attention toward using deep learning methods for phishing e-mail generation, surpassing template-based approaches. For instance, in Das and Verma [2019], the authors proposed using Recurrent Neural Networks (RNNs) to automatically generate phishing emails. The RNN model learned both legitimate and malicious e-mails in a 1:4 ratio, allowing the generated e-mails to evade detection of fraudulent attributes. However, it was observed that the generated content might become less fluent or fail to form coherent e-mails, leading to decreased effectiveness of phishing e-mail filter classification.

Another interesting study is Baki et al. [2017], which focused on investigating the impact of readability on recipients’ identification of attack signals inserted into e-mails. They used a Recursive Transition Network (RTN) model to generate e-mails and found that 17% of participants could not correctly identify fake e-mails. Notably, there wasn’t a distinct differentiation between different attack signals they proposed, suggesting that the success of attacks might not solely depend on the signals themselves but could also be influenced by psychological principles proposed by Lin et al. [2019].

In summary, past research has demonstrated the feasibility of using deep learning techniques for phishing e-mail generation. However, models like RNN or RTN might struggle to effectively capture contextual relationships in text, especially for controlling e-mail topics. Fully automated generation of phishing e-mails that match both subject and content still poses challenges.

Recently, Yu et al. [2020] introduced a conditional approach for inserting data into the process of generating phishing e-mails. This method enhances the number of phishing samples without altering the attributes of fraudulent e-mails, thereby boosting the performance of phishing e-mail classifiers.

To sum up the discussions above, previous research mainly focused on single-language generation but lacked exploration of cross-lingual generation capabilities. In Table 2 we summarize recent related phishing email generation research, which indicates the differences between our study’s objectives and existing literature. Our study fills this gap, providing a framework for phishing e-mail generation in multi-language contexts.

Table 2.

Authors	Phishing E-mail Generation		Multilingual
Authors	Source	Conditional	Multilingual
[Mohebzada et al. 2012]	Template	$\checkmark$	-
[Chen and Rudnicky 2014]	Template	$\checkmark$	-
[Ferreira and Lenzini 2015]	Template	$\checkmark$	-
[Baki et al. 2017]	Template	-	-
[Lin et al. 2019]	Template	$\checkmark$	-
[Das and Verma 2019]	Neural Network	-	-
[Yu et al. 2020]	Insertion	-	-
[Ours 2023]	Neural Network	$\checkmark$	25 languages

Table 2. Existing Works on Phishing Email Generation

2.3 Cross-lingual Transfer Learning

In the domain of cross-lingual generation tasks, research has predominantly progressed along two distinct paths. The first approach centers on fine-tuning existing multilingual models, including but not limited to mBART [Liu et al. 2020], mT5 [Xue et al. 2021], and ZmBART [Maurya et al. 2021]. This method leverages the inherent multilingual capabilities of these models, refining them for specific cross-lingual applications. Conversely, the second strategy adopts the use of adapters in conjunction with multilingual models, such as MAD-X [Pfeiffer et al. 2020b], MAD-G [Ansell et al. 2021], and mmT5 [Pfeiffer et al. 2023]. Adapters offer a modular way to augment these models with additional layers specifically designed for cross-lingual tasks, enabling more flexible and efficient customization without the need for extensive retraining of the base models.

The mBART and mT5 language models provide foundational support for representing multiple languages, setting the stage for advancements in cross-linguistic processing capabilities. An evolution of the mBART model, ZmBART, introduces enhancements that further refine its performance. Conversely, the MAD-X and MAD-G frameworks extend the capabilities of the XLM-R model, as documented in the work by [Conneau et al. 2020], focusing their application on the specific challenges of cross-linguistic natural language understanding (NLU) tasks. Building on the mT5 model, the mmT5 represents a targeted effort to mitigate the issue of source language hallucination, a notable obstacle in multilingual model accuracy. In the following discussion, we delve into a detailed review of these pivotal contributions to the field.

2.3.1 Multilingual-based Methods.

Multilingual BART [Liu et al. 2020] is the multilingual variant of the BART model [Lewis et al. n. d.]. The mBART model employs various language corpora from the Common Crawl dataset and undergoes multilingual pretraining based on the BART architecture. The training data is a concatenation of data from K different languages, denoted as $D = {D_1, D_2...D_K}$, where $D_i$ represents the collection of monolingual documents for language i.

The authors adopt two distinct methods for preparing the pretraining data: (1) random token span masking and (2) sentence order permutation. The pretraining objective is to train mBART as a Denoising Autoencoder. Throughout the training process, the model is tasked with predicting the original text x from its corrupted counterpart $g(x)$, with g being a noise function. The aim is to maximize the following objective function:

\begin{equation} \mathcal {L_\theta } = \sum _{D_i\in D} \sum _{x\in D_i} logP (x|g(x);\theta). \end{equation}

(1)

Here, x represents a data instance from Language i.

ZmBART [Maurya et al. 2021] is an enhanced version of the mBART model specifically designed for Text Generation tasks through Task Specific Pretraining. It observes that the initial training process of mBART did not specifically pretrain for generation tasks. The primary training methods for BART and mBART involve token masking, token deletion, sentence ordering, and so on. However, pretraining for downstream tasks involving long-text generation was not performed. As a result, ZmBART introduces a Task Specific Pretraining approach. The idea is to pretrain mBART to generate several randomly selected sentences from a given article. This methodology strengthens mBART’s text generation capability.

In the context of the low-resource phishing email generation objective outlined in this article, the architecture of ZmBART is also a potential approach. However, ZmBART’s Task Specific Pretraining (referred to as an Auxiliary Task in the article) involves direct re-fine-tuning of the original model’s parameters, which raises concerns regarding the need for more training data and time. In this study, we adopt the Adapter architecture [Pfeiffer et al. 2020a], aiming to achieve better Task Specific Pretraining at the architectural level. Additionally, in the experimental section, we also implement an enhanced version of ZmBART for comparison. The experiments also demonstrate that our proposed X-Phishing-Writer based on the Adapter architecture offers superior generation performance.

Multilingual T5 (mT5) [Xue et al. 2021] is a multilingual pre-trained model based on the T5 (Text-to-Text Transfer Transformer) architecture [Raffel et al. 2020], developed by the team at Google Research. It is designed as a universal model to handle various natural language processing (NLP) tasks across multiple languages within a unified framework. The model employs the encoder–decoder architecture of T5, where both the encoder and decoder are based on the Transformer architecture. This allows the model to accept text input and produce corresponding text output, enabling it to handle a wide range of text-to-text tasks such as translation, summarization, and text classification.

2.3.2 Adapter-based Methods.

The study [Pfeiffer et al. 2021] is the first to introduce Adapters into the framework of Cross-lingual Transfer Learning. Historically, Adapters were primarily introduced to reduce the transfer learning cost of fine-tuned or pre-trained language models for new target tasks, thereby minimizing the amount of training data needed for transfer learning. With Adapters showing success across various tasks, researchers have started to focus on their cross-lingual applications. In this context, Pfeiffer et al. [2021] introduce the concept of Language-specific Adapters. These Adapters are designed to be used independently, enabling the learning of language-specific representation capabilities, thus achieving cross-lingual transfer.

MAD-X [Pfeiffer et al. 2020b], a versatile Adapter architecture, consists of three main Adapter types: (1) Task Adapters, for task-specific learning, (2) Language Adapters, for learning representation capabilities across different languages, and (3) Invertible Adapters, aimed at simplifying model training. The model architecture of MAD-X is illustrated in Figure 2. In each layer of the Transformer model, the traditional multi-head attention and feed-forward modules are retained. However, unlike traditional models, after each feed-forward module, a Language Adapter is introduced, followed by task adapters, and then normalization. Input embeddings are processed through invertible adapters, and the output layer also includes an invertible adapter to match the vocabulary $\beta$ of multiple languages and the target language.

Fig. 2.

On the other hand, X-Gear [Huang et al. 2022] utilizes multilingual pretrained generative language models such as mBART and mT5 to perform zero-shot cross-lingual event parameter extraction tasks. This research encodes event structures and captures dependency relationships between parameters. They achieve cross-lingual transfer by designing language-agnostic templates to represent event parameter structures, making the structure compatible with any language. However, their experiments primarily focus on the field of NLU, particularly in the context of Event Argument Extraction. This differs from our Phishing Email Generation task.

The MAD-G model [Ansell et al. 2021] represents a notable advancement over its predecessor, MAD-X, as developed by the same research team. It is specifically tailored for tasks such as Part-of-Speech Tagging, Dependency Parsing, and Named Entity Recognition, employing an encoder to leverage contextual information for cross-lingual generation. However, it’s important to recognize the limitations associated with encoder-based language generation, particularly in handling longer text segments. Due to constraints on text length, encoders may not perform as effectively as auto-regressive methods in generating extensive texts, which require managing a larger contextual scope. Auto-regressive models excel by adjusting dynamically to the evolving context, thus maintaining coherency and fluidity over longer narratives. While encoders offer substantial benefits for sentence-level or short-text cross-lingual tasks, their efficacy diminishes with increased text length.

On the other hand, mmT5 [Pfeiffer et al. 2023] builds upon the mT5 framework to specifically tackle source language hallucination—a scenario where language models generate accurately meant text in an unintended language. mmT5, a pre-trained model utilizing an encoder–decoder architecture, introduces a bottleneck adapter as its language module, coupled with a shared parameters layer for enhanced language representation. Initially, mmT5 undergoes supervised pre-training on the mC4 multilingual parallel corpus, followed by a denoising sequence-to-sequence pre-training regimen on a specialized adapter. This process simultaneously refines the parameters of mT5, the adapter, and the shared parameters layer. In downstream task fine-tuning, the original mT5 model remains static, with adjustments made only to the language module and shared parameters. The inference phase employs a fixed route method, selecting the relevant language module and shared parameters based on the task and source language. Despite its innovative approach to mitigating source language hallucination, mmT5’s effectiveness in domain-specific, cross-lingual tasks has yet to show marked improvements, indicating an area ripe for further exploration.

In order to clearly position our study, we present a summary of relevant cross-lingual generation research in Table 3. We compare our research with existing studies from five dimensions: Target Task, Use of Adapter Module, Transformer Architecture (Encoder only or Encoder+Decoder), Pretraining Strategy, and Base Model. From the table, we can identify the unique features of our article: the first to combine the Adapter architecture with NLG tasks, while also conducting Task-specific Pretraining research.

Table 3.

Authors	Task	Adapter	Transformer Architecture	Strategy	Base Model
MAD-X [Pfeiffer et al. 2020b]	NLU	$\checkmark$	Encoder	MLM Pre-training	XLM-R
MAD-G [Ansell et al. 2021]	NLU	$\checkmark$	Encoder	MLM Pre-training	XLM-R, mBERT
ZmBART [Maurya et al. 2021]	NLU	-	Encoder+Decoder	Task-specific Pre-training	mBART
X-Gear [Huang et al. 2022]	NLU	-	Encoder+Decoder	Template	mBART,mT5
mmT5 [Pfeiffer et al. 2023]	NLG	$\checkmark$	Encoder+ Decoder	Denoise Seq2Seq Pre-training	mT5
X-Phishing-Writer [Ours, 2023]	NLG	$\checkmark$	Encoder+ Decoder	Task-specific Pre-training	mBART,mT5

Table 3. Studies on Cross-lingual Generation

3 Methodology

In this section, we begin by reviewing the Adapter architecture, which serves as the foundation component of our X-Phishing-Writer. Subsequently, in Section 3.2, we offer an overview of the X-Phishing-Writer framework. Finally, in Section 3.3, we delve into the training and inference of the adapter modules within our framework.

3.1 Adapter Review

The Adapter module, as introduced by Houlsby et al. [2019], offers a transfer learning technique that allows for efficient parameter utilization and rapid fine-tuning of a pre-trained Transformer model for new tasks. At its core, this method involves augmenting each transformer layer of the original model with network modules characterized by the parameter $\Phi$. Specifically, every Transformer layer, denoted as l, is enhanced with the following supplementary modules: the down-projection $D_l$, a non-linear activation function $\alpha$, the up-projection $U_l$, and a residual connection $r_l$ [He et al. 2016]. These additions culminate in the formation of the ab Adapter layer, $A_l$.

The Adapter’s functionality can be expressed as

\begin{equation} A_l(h_l, r_l) = U_l(\alpha (D_l(h_l))) + r_l, \end{equation}

(2)

where $h_l$ symbolizes the hidden state of the Transformer layer, while $r_l$ denotes the residual connection for the lth layer. When the Adapter is trained across N tasks, the model, for each specific task, initializes using the parameters from the pre-trained language model, $\Theta _0$, and incorporates additional Adapter parameters, $\Phi _n$, initialized randomly. The original parameters, $\Theta _0$, remain static, with training adjustments made solely to the Adapter parameters, $\Phi _n$. This methodology facilitates the simultaneous training of all N tasks, preserving the inherent knowledge. The training objective for every task, represented as $n \in {1, \ldots , N}$, is defined as

\begin{equation} \Phi _n \leftarrow \underset{\Phi }{\arg \min }\ L_n(D_n; \Theta _0, \Phi). \end{equation}

(3)

Here, $L_n$ corresponds to the loss associated with task n, and $D_n$ refers to the labeled dataset for that task.

During the model’s training phase, the parameters $\Theta$ of the foundational transformer architecture are set to be immutable. The training adjustments are confined to the adapter parameters, $\Phi$. This architectural design ensures that only the appended parameters undergo training in the Adapter-integrated model’s training phase. This approach not only slashes the data volume needed for training new tasks but also curtails associated training expenses.

3.2 X-Phishing-Writer Framework

We extend the MAD-X architecture from Pfeiffer et al. [2020b] to address the low-resource cross-lingual transfer task in phishing e-mail generation. However, as discussed in Section 2, MAD-X is based on an Encoder-only architecture, and it employs the Masked Language Model (MLM) mechanism for language-specific learning. Consequently, MAD-X cannot be directly applied to language generation tasks. To address this limitation, we propose a two-stage architecture for learning both the Generative Language Adapter (GLA) and the Task Adapter separately. Figure 3 provides an overview of our X-Phishing-Writer framework.

Fig. 3.

Our X-Phishing-Writer architecture consists of two training stages: GLA Training and Task Adapter Training. Their roles and training approaches are as follows:

•

Multilingual Language Adapter Training: The goal of this stage is to perform Generative Task Pre-training. Our fundamental approach involves training the model in an unsupervised manner to generate complete text based on given keywords, and this is done using different language corpora. Similar to all Adapter-based methods, during training, we only train the GLA module, without updating the parameters of the original base model (in our study, we use mBART as the base model). Our cross-lingual Language Adapter consists of two types of Adapters: Invertible Adapter and GLA.

—

Invertible Adapter: The Invertible Adapter is configured in pairs (composed of an Invertible and an Inversed Adapter), placed before the input of the embedding and after the output. Its purpose is to provide language-specific embedding functions for various languages in multilingual embeddings. Detailed steps are in Section 3.3.1.

—

In contrast to the MLM pretraining utilized by MAD-X, our approach embraces a generative paradigm for language pretraining. We introduce an innovative unsupervised pretraining task, dubbed the Keyword2Text (K2T) task, with the primary aim of generating cohesive paragraphs from a set of provided keywords. During the K2T training phase, we employ a methodology that utilizes existing text datasets to generate our training data. For example, in this study, Wikipedia serves as the backbone of our training corpus. We extract key terms from a given paragraph within a Wikipedia article to use as inputs, wherein these keywords include significant individual words from the text, and task the model with generating a paragraph that encapsulates the essence of those keywords, effectively mirroring the original text in coherence and relevance. Further details on this process are elaborated in Section 3.3.2.

The Multilingual Language Adapter Training phase is characterized by the dedicated training of separate Adapters for each target language. Taking Chinese as an example, we tailor both GLAs and Invertible Adapters using Wikipedia’s Chinese corpus as the training material. This procedure is replicated for each language included in our study, resulting in a distinct set of Adapters for each one. If our aim is to facilitate cross-lingual generation across five languages, we will thus develop five individual sets of Adapters, each uniquely aligned with a specific language. Illustrated in Figure 3 (referencing Stage 1, on the left), this architecture allows for the transformation between languages, such as from Chinese to English, through the use of dedicated Invertible Adapters and GLAs for both languages involved.

•

Phishing Task Adapter Training: In this stage, the goal is to equip the model with the ability to craft a phishing e-mail from a set of provided keywords. To accomplish this, we utilize a phishing e-mail dataset in a specific language, such as the Enron dataset, as the foundation for our training material. For each e-mail within the dataset, we identify and extract keywords that will act as inputs for the model, while the full content of the e-mail is set as the desired output target.

The training process introduces a specialized component known as the Phishing Task Adapter (PTA). This Adapter becomes the focal point of parameter training during this phase, tasked with fine-tuning the model to adeptly generate phishing e-mails aligned with the extracted keywords. Structurally, the input initially traverses through the specified language’s Invertible Adapter and GLA. Subsequently, it is processed by the PTA–where, importantly, only the parameters of this Adapter are adjusted during training. The processed input then passes through an Inverse Adapter, culminating in the generation of the final e-mail content. This systematic approach ensures a tailored training regimen that effectively teaches the model the intricacies of phishing e-mail generation.

For a comprehensive breakdown of the training steps involved, refer to Section 3.3.3, which provides detailed insights into the procedure and structural setup of this stage, as depicted in Figure 3 (Stage 2, on the right).

3.3 X-Phishing-Writer Adapters

3.3.1 Invertible Adapters.

First, the Invertible Adapter consists of two functions, denoted as $F(x)$ and $G(x)$, which are constructed using the Up-projection function U, the Down-projection D, and the ReLU activation function ReLU, forming a neural network structure. As shown in Equation (4):

\begin{equation} \begin{aligned}F(x) = U_{F}(ReLU(D_{F}(x))) \\ G(x) = U_{G}(ReLU(D_{G}(x))). \end{aligned} \end{equation}

(4)

The concept of the Invertible Adapter $A_{inv}$ is illustrated in Equation (6). To start, let the embedding corresponding to the ith token input be denoted as $e_i$. Next, we split $e_i$ into two equal-dimensional vectors, namely $e_{1,i}$ and $e_{2,i} \in \mathbb {R}^{\frac{h}{2}}$. In this context, $e_2$ is transformed through F to generate a new embedding denoted as $e_{2}^{^{\prime }}$. Furthermore, to retain the source language information of $e_{2}^{^{\prime }}$, we add back $e_2$ to obtain o1, thus preserving the information of $e_1$ to obtain $o_1$. Subsequently, in order to obtain the hidden representations of both $e_1$ and $e_2$ simultaneously, we apply the G function to $o_1$ to get $o_{1}^{^{\prime }}$, then add $e_2$ to get $o_2$. At this stage, a reversible embedding $[o_1, o_2]$ is formed.

\begin{equation} \begin{aligned}&e_{2}^{^{\prime }} = F(e2); &o_{1} = e_{2}^{^{\prime }} + e_{1}\\ &o_{1}^{^{\prime }} = G(o1); &o_{2} = o_{1}^{^{\prime }} + e_{2} \\ &o = [o_{1}, o_{2} ] \end{aligned} \end{equation}

(5)

Here, o represents the output of $A_{inv}$, and $[,]$ indicates concatenation of the two hidden representations.

(6)

o is the output of $A_{inv}$, and $[,]$ signifies the concatenation of the two hidden representations.

During the training of the model, the learning is focused on the new embedding o, which is no longer the input token embedding. As a result, when prediction is required after training, it becomes necessary to revert back to the original token embedding in order to obtain the actual token. This reversal process is achieved using the inversed adapter $A_{inv}^{-1}$. Its operation is depicted in Equation (7). Initially, o is divided into o1 and o2, and subsequently subjecting them to the same operations enables the restoration of $e_1$ and $e_2$.

\begin{equation} \begin{aligned}&o_{1}^{^{\prime }} = G(o_1); &e_2 = o_2-o_{1}^{^{\prime }}\\ &e_{2}^{^{\prime }} = F(e_2); &e_1 = o_1-e_{2}^{^{\prime }}\\ &e = [e_1, e_2 ] \end{aligned} \end{equation}

(7)

3.3.2 Generative Language Adapter (GLA).

Architecture. We adopt the architecture proposed by Houlsby et al. [2019], which includes the GLA with residual connections $GLA_{l,Encoder}$, positioned at the lth layer of the Encoder. This adapter consists of Down-projection $D \in \mathbb {R}^{h \times d}$, where h is the hidden size of the Transformer model and d represents the dimension of the adapter. This is followed by a ReLU layer and an Up-projection function $U\in \mathbb {R}^{d \times h}$ at each layer l. Similarly, $GLA_{l,Decode}$ employs a similar structure but is positioned within the Transformer Decoder. The operation is formulated as follows:

\begin{equation} \begin{aligned}&GLA_{l,Encoder}(h_l, r_l) &= U_l(ReLU(D_l(h_l)))+r_l \\ &GLA_{l,Decoder}(h_l, r_l) &= U_l(ReLU(D_l(h_l)))+r_l, \\ \end{aligned} \end{equation}

(8)

where $h_{l}$ and $r_{l}$ represent the hidden state and residual of the Transformer’s lth layer, respectively. The residual connection $r_{l}$ originates from the output of the feed-forward layer of the Transformer, while $h_{l}$ is the output after normalization (Add & Norm).

By incorporating these adapters into the layers of both the Encoder and the Decoder, the model becomes adept at capturing and handling subtle linguistic nuances, thereby enhancing the quality of the generated text. The combination of Down-projection, the ReLU function, and Up-projection at each layer ensures appropriate transformations of language features, integrating them into the broader context of the Transformer architecture. This framework not only demonstrates the potential to enhance language generation capabilities but also emphasizes the significance of employing specialized adapter structures when considering the unique requirements of a given task.

Training. Note that the GLAs of different languages are trained separately. As shown in Figure 3, for a bilingual transfer, we train two GLAs: target language adapter (G. Lang Src Adapter) and source language adapter (G. Lang Tgt Adapter).

As mentioned, the training goal is to take a set of keywords to generate a passage. In this study, we use the Wikipedia corpus for training GLAs of various languages. Specifically, given a Wikipedia of some language (e.g., Chinese), we use BM25 to select the top-k keywords (denoted by $Keyword_{wiki}$) from a given passage and set to training a generation model to generate a passage $Context_{wiki,\lbrace Src\rbrace }$.

As a concrete example, for a given passage “Duanganong Aroonkesorn is an internationally elite badminton player from Thailand. She competed at the 2006, 2010 and 2014 Asian Games. Aroonkesorn is a women’s doubles specialist who is paired with Kunchala Voravichitchaikul,” four keywords (“Aroonkesorn, Voravichitchaikul, Duanganong, Doubles”) are extracted. The training goal is to take the keywords to restore the given passage.

3.3.3 Phishing Task Adapters (PTA).

Architecture. As illustrated in Figure 3, the PTA at each layer l has the same architecture as the GLA. The PTA at each layer consists of a Down-projection $D \in \mathbb {R}^{h \times d}$, where h is the hidden size of the Transformer model and d represents the dimension of the adapter. This is followed by a ReLU layer and an Up-projection function $U\in \mathbb {R}^{d \times h}$ at each layer l. We incorporate a Phishing Task Adapter (PTA) into each layer of the Transformer architecture in both the Encoder and Decoder components:

\begin{equation} \begin{aligned}&PTA_{l,Encoder}(h_l, r_l) &= U_l(ReLU(D_l(GLA_l)))+r_l \\ &PTA_{l,Decoder}(h_l, r_l) &= U_l(ReLU(D_l(GLA_l)))+r_l, \end{aligned} \end{equation}

(9)

Training. During the training of the PTA, we first utilize the GLA trained in the first step. Then, we place a Phishing Adapter after each GLA, responsible for taking the hidden representation output from the GLA. Throughout the training process, all model parameters, including the GLA, remain unchanged, with only the weights of the Phishing Adapter being updated. Unlike the unsupervised training of the GLA, the Phishing Adapter’s training is supervised and requires phishing e-mail data.

The process of preparing training data for the Phishing Adapter is similar to that of the GLA. However, we extract keywords from the given phishing e-mail content to train the Phishing Adapter to generate phishing e-mails. For example (as shown in Table 4), from a given phishing e-mail (e.g., “Bank of America Internal Email Box Notification Your Online Banking is Blocked Because of unusual number of invalid login attempts on your account, we had to believe that there might be some security problem on your account”), we extract keywords (in this example, we extract “America,” “Security,” “Bank,” “Blocked,” and “Account,” which are the five keywords). The training objective is to enable the model to generate the email contents based on these keywords.

Table 4.

Keywords: America, Security, Bank, Blocked, Account
Outputs: Your Online Banking is Blocked, Bank of America Internal Email Box Notification Your Online Banking is Blocked Because of an unusual number of invalid login attempts on you account, we had to believe that, their might be some security problem on you account. .... Please click on sign in to Online Banking to continue to the verification process and ensure your account security.

Table 4. A Generation Result Example

It’s worth emphasizing that our approach only requires a single-language phishing dataset. Once we’ve trained a phishing adapter for one language, we can employ it in the generation process for different languages. In this article, we utilize an English phishing dataset, where the task involves inputting a set of English keywords ($Keywords_{phi,en}$) and obtaining language weights through the Generative Language English Adapter. The Phishing Adapter’s training process does not necessitate keywords or text from other languages.

3.3.4 Cross-lingual Phishing Email Generation.

Inference. Upon the training of both the GLA and the PTA, we unlock the capability for cross-lingual phishing e-mail generation. The mechanism underlying this process is depicted in Figure 4. To facilitate language transformation, the Generative Language Source Adapter is replaced by the Generative Language Target Adapter (GTA) beneath the PTA. Additionally, the corresponding inverse adapter $A_{inv}$ is positioned after the input embeddings, complemented by its inverse $A_{inv}^{-1}$ prior to the output embeddings. This arrangement is crucial for conducting the language feature transformation.

Fig. 4.

The transformation process initiates with the input keywords $Keyword_{phi,Tgt}$, which are first subjected to language feature transformation via $A_{inv}$. This step adjusts the model’s perception to the target language features, denoted as $[o_1,o_2]{kw}$. These features then traverse the GTA, culminating in the creation of the phishing e-mail’s embedding, represented as $[o_1,o_2]{phishing}$. The journey concludes with the application of $A_{inv}^{-1}$, which reconverts the phishing embedding back into token embeddings. This final transformation generates the text of the phishing e-mail, articulated as $Context_{phi,Tgt}$.

4 Performance Evaluation

4.1 Experiment Settings

In this study, we conduct two sets of experiments to evaluate the performance of X-Phishing-Writer: (1) a quality evaluation of Cross-Lingual Phishing E-mails and (2) an effectiveness evaluation of these e-mails within a SED setting. To facilitate these experiments, we utilize the Nazario Phishing Corpus and the Enron e-mail dataset to create a new phishing e-mail dataset. This dataset is segmented into training, validation, and testing splits, following a 3,000/1,000/1,000 distribution of English e-mails, specifically for training the PTA.

For evaluating the impact of the SED, we employ GoPhishing, an open-source software, as the exercise platform. Participants include students and faculty members from a university, with 830 students in the younger group, and 267 staff members alongside 585 faculty members in the older group. During a 10-day period, one phishing e-mail is sent daily to the participants’ campus Gmail accounts. We monitor metrics, such as e-mail open rates and click-through rates, to assess the effectiveness of the phishing simulation.

It is important to emphasize the ethical considerations of our study. The experiment is conducted anonymously, ensuring no personal information about the participants is recorded. Furthermore, we obtain explicit consent from the university before initiating the experiment, affirming our commitment to maintaining the highest standards of research integrity and participant privacy.

During training, we use AdamW as the optimizer and an initial learning rate of 2e-5 for multilingual language models. with a batch size of 8, texts max length set to 128, and the maximum number of epochs set to 30. All experiments are conducted using two NVIDIA TITAN RTX GPUs.

4.2 Data Preparation and Preprocessing

In our study, as mentioned, we leverage the Nazario phishing e-mail corpus and the Enron e-mail dataset as our foundational data sources. From the Nazario corpus, 3,000 phishing e-mails were extracted, complemented by 2,000 fraudulent (non-spam) e-mails from the Enron dataset. These collections were amalgamated and segmented into training, validation, and testing sets, consisting of 3,000, 1,000, and 1,000 e-mails respectively. English served as the principal source language throughout our experiments, while we also aimed to extend our analysis across an additional 24 target languages. Given the scarcity of phishing e-mails in languages other than English, Google Translate was employed to facilitate the translation of e-mail content. These translated versions played a crucial role not just in assessing the model’s proficiency in generating content across various languages but also in training the PTAs within a few-shot learning framework, thereby broadening their capability to recognize phishing endeavors in diverse linguistic contexts.

Our evaluation principally scrutinized the cross-lingual efficacy of X-Phishing-Writer, spanning 25 distinct languages. We procured XML dump files of the official Wikipedia dataset, dated 1 July 2023, encompassing 24 languages. Utilizing the Wiki Extractor tool alongside bespoke scripts, we curated 3,000 Wikipedia articles for each language, leading to a total of 3,000 articles per language. This process entailed data cleaning and the extraction of vital elements such as titles and main text. Keywords were subsequently derived from each article using the BM25 algorithm, furnishing both training and testing material for the GLA.

4.3 Compared Models

We evaluate the performance of X-Phishing-Writer against several baselines to establish its efficacy in generating cross-lingual phishing e-mails. During the implementation, we employ mBART and mT5 as our base model for constructing our X-Phishing-Writer. The compared models include:

—

mBART: We employ the mBART-large model in its pre-trained state without additional training, generating e-mails in various languages directly from keywords.

—

mBART+K2T: Enhancing the ZmBART approach [Maurya et al. 2021], we pre-train mBART-large using the K2T task for different languages. Unlike our X-Phishing-Writer, mBART+K2T lacks the Adapter components, positioning it as the state-of-the-art against which we benchmark.

—

mmT5-Adapted: Given that the original mmT5 model is not publicly available, we have developed our own version, termed mmT5-Adapted, based on the framework outlined in the official mmT5 report [Pfeiffer et al. 2023]. This adapted version closely aligns with the parameter scale of the mBART-large model, which contains 610 million parameters, thereby facilitating a fair comparison.

Furthermore, we examine different training variants of X-Phishing-Writer to understand performance disparities across zero-shot, few-shot, and full-shot scenarios, particularly using English as the source and Chinese as the target language:

—

Zero-shot (Cross-lingual Zero-shot): In this variant, the model is fine-tuned exclusively on English phishing e-mails. During inference, we leverage the G. Lang Target Adapter and the corresponding Inverse Adapter to generate phishing e-mails from Chinese keywords, assessing the model’s cross-lingual generation capability without prior exposure to the target language.

—

Few-shot (Cross-lingual Few-shot): This experiment utilizes a limited set of samples, incorporating 30% of Chinese phishing e-mails during fine-tuning. A mix of 500 English e-mails and 150 Chinese e-mails–non-parallel corpora–is employed to determine if introducing a subset of target language e-mails can improve model performance within our architecture.

—

Full-shot: Here, the model undergoes fine-tuning with the entire set of Chinese phishing e-mails on the Task Adapter, exploring the performance gap between Cross-lingual Zero/Few-shot and full-shot learning scenarios.

4.4 Performance Evaluation Results

In this subsection, we delve into the performance differences between the X-Phishing-Writer and compared baseline models. Our primary evaluation focus is on the generated results in Chinese. Additionally, we conduct human evaluation, primarily comparing the generation outcomes of X-Phishing-Writer against those of the best baseline model.

For automated evaluation, we employ lexical matching metrics (BLEU and ROUGE) as well as embedding-based evaluation metrics (BERTScore). To assess the phishing e-mail generation task, we utilize the BLEU-1 (BL) score, ROUGE(R1,R2) score, ROUGE-L (RL) score, and BERTScore (BS) score, where BS incorporates the Multilingual-BERT model. These evaluation metrics aid us in the automated assessment of the quality of model-generated outputs.

In our study, we extend beyond conventional automated evaluation metrics for generation quality by introducing a specialized evaluation framework. This framework includes the Phishing-Classifier, a pivotal tool designed to discern the authenticity of the generated content’s alignment with genuine phishing e-mails.

Classifier Training and Purpose: We develop a classifier, denoted as C, trained on a corpus of Chinese phishing e-mails alongside Wikipedia datasets, utilizing the BERT-Chinese large model. The classifier’s primary objective is to investigate the potential influence of the GLA on the content generated by the PTA. Impressively, this classifier achieves an F1 score of 99.5%, indicating its high reliability in distinguishing phishing content from non-phishing content.

Domain-Accuracy (D-ACC) Metric: To specifically assess the domain relevance of model-generated content, we introduce the Domain-Accuracy (D-ACC) metric. This metric is designed to evaluate the extent to which generated content adheres to the phishing e-mail domain, distinguishing it from unrelated Wikipedia content. The application of the Phishing-Classifier in this context enables a precise measurement of the generated content’s domain accuracy, as outlined in the following equation:

\begin{equation} D-ACC = \frac{1}{N} \sum _{i=1}^{N} C(x_{i}). \end{equation}

(10)

Here, for a given text $x_i$ and a total dataset size of N, the classifier C assesses whether each piece of model-generated content aligns with phishing e-mail characteristics or falls outside this domain, culminating in an aggregated accuracy score, D-ACC.

Through this approach, we expect to ensure that its outputs are accurately reflective of the phishing e-mail domain, thus affirming the model’s utility in generating contextually relevant and domain-specific content.

4.4.1 Results under Zero-shot, Few-shot, and Full-shot.

In this subsection, we categorize performance evaluation into the three following scenarios: Zero-shot, Few-shot, and Full-shot, observing how different models fare under each setting.

Zero-shot Performance: The X-Phishing-Writer exhibits a marked superiority over baseline models, as depicted in Figures 5(A) to 5(D). It shines in terms of BLEU and ROUGE scores, particularly demonstrating a notable lead in D-ACC score, with an impressive margin of up to 78.33 points. In contrast, mBART (Naive) shows lackluster performance in BS evaluation (Figure 5(E)), highlighting challenges in maintaining fluency and readability. These findings underscore the limitations of naive approaches in adapting to the K2T task and highlight the crucial role of sophisticated knowledge transfer mechanisms in achieving effective cross-lingual generalization, as evidenced by mBART-K2T’s suboptimal Zero-shot transfer capability. The Zero-shot scenario stresses the models’ ability to generalize without target language training data, emphasizing the significance of adept pre-training and transfer learning methodologies in NLP tasks.

Few-shot Setting: With the introduction of a modest number of target language examples, X-Phishing-Writer maintains its lead against baselines in key metrics (Figures 5(A) to 5(D)), notably in D-ACC (Figure 5(D)). This underscores its capability not only in cross-lingual tasks but also in generating quality phishing e-mails with limited target language data, validating the effectiveness of its learning approach.

Fig. 5.

Full-shot Analysis: Upon integrating the complete dataset of target language e-mails, mBART-K2T shows improvement in D-ACC (Figure 5(D)) but a decline in other metrics. This suggests a tendency toward generating more varied content, which, while indicating adaptability, may diverge from the intended phishing e-mail characteristics. Conversely, X-Phishing-Writer, with its Adapter-based transfer learning, consistently aligns closely with the target examples, producing high-quality phishing e-mails. This scenario highlights X-Phishing-Writer’s robust architecture and its ability to maintain fidelity to the original intent across diverse linguistic contexts, showcasing its practicality for cross-lingual phishing e-mail generation.

4.4.2 Effects of Varying the Few-shot Setting Size.

To observe the impact of training data volume on the performance of X-Phishing-Writer, we varied the size of the training set, introducing data in increments of 500 training instances to the model for training. This setting allowed for a nuanced comparison of X-Phishing-Writer with the mBART-K2T model across settings: Zero-shot, Few-shot, and Full-shot.

Zero-shot: Within the Zero-shot setting, X-Phishing-Writer consistently outperforms mBART-K2T, as detailed in Figures 6(A) to 6(D). Particularly noteworthy is its proficiency in the BS metric (Figure 6(E)), highlighting superior fluency and readability with scores reaching up to 60% without any target language data. Conversely, mBART-K2T struggles significantly in both BS and D-ACC metrics (Figure 6(D) to 6(F)), indicating a deficiency in generating coherent and recognizable phishing content.

Fig. 6.

Few-shot Scenario: Shifting focus to the Few-shot setting, where a limited set of target language examples are introduced, X-Phishing-Writer demonstrates a remarkable capacity to maintain high-quality output regardless of the data volume provided (Figure 6(A) to (D)). This contrasts with mBART-K2T, whose performance is significantly more sensitive to the amount of training data.

Full-shot Scenario: In the Full-shot context, where the model has access to an extensive set of 3,000 target language samples, mBART-K2T shows temporary improvements over X-Phishing-Writer for certain data quantities. However, this advantage does not hold consistently, with mBART-K2T’s performance experiencing a considerable decline as demonstrated in Figure 5. This suggests that, despite initial gains, mBART-K2T fails to sustain high performance levels across all sample sizes, in contrast to X-Phishing-Writer’s more stable and robust output across varying training volumes.

In conclusion, our study underscores the significant impact of employing Adapters on model performance in diverse settings, with a special focus on cross-lingual task transferability. X-Phishing-Writer, through its innovative use of Adapters, showcases exceptional capability in generating high-quality, robust cross-lingual phishing e-mails. These results provide critical insights into improving the design and functionality of NLP models, emphasizing the effectiveness of tailored transfer learning approaches.

Future avenues of research could explore further optimization of transfer learning strategies, particularly in enhancing model adaptability across various linguistic and task-specific contexts. By pushing the boundaries of current methodologies, we aim to broaden the scope of applications for NLP models, ensuring more versatile and impactful deployments in real-world scenarios.

4.4.3 Performance Comparison with mmT5-Adapted.

In this subsection, we present a performance comparison result of X-Phishing-Writer against the mmT5-Adapted under various experimental settings. For a fair comparison, we use mT5 as the base model for constructing X-Phishing-Writer. Figure 7 shows the performance comparison between X-Phishing-Writer (mT5) and mmT5-Adapted.

Fig. 7.

In the Zero-shot setting, X-Phishing-Writer notably surpasses mmT5-Adapted, underscoring our framework’s proficiency in managing low-resource conditions and its ability to produce high-quality text absent target language training data. This outcome emphasizes the framework’s adaptability to zero-source languages, reinforcing its suitability for tasks with limited linguistic resources.

Transitioning to the Few-shot and Full-shot settings, our evaluation focuses on generation quality and D-ACC. Both X-Phishing-Writer and mmT5-Adapted exhibit comparable capabilities in generating high-quality textual content. Nevertheless, mmT5-Adapted tends to outperform in scenarios where ample target language data is available, highlighting its efficiency in data-rich environments. However, as evidenced in Figure 7(F), X-Phishing-Writer demonstrates a significant advantage in D-ACC, showcasing its superior precision in generating domain-specific text.

In conclusion, our experimental findings validate X-Phishing-Writer’s performance in low-resource scenarios and its precision in domain-specific text generation. We believe that the performance of X-Phishing-Writer can be attributed to its design choice of utilizing shared GLA embeddings, in contrast to mmT5’s approach of employing separate adapters for different languages. This design difference significantly enhances our model’s efficiency in the zero-shot scenario, which is especially pertinent for the cross-lingual generation of phishing emails. Given the challenge of acquiring training data for phishing emails across multiple languages, the zero-shot capability emerges as a critical feature.

4.4.4 Performance on Other Language Settings.

To ascertain the efficacy of our X-Phishing-Writer, we extend our evaluation to encompass an additional 25 languages, utilizing Rouge-L and BERTScore metrics for a comprehensive assessment. The results of this analysis are shown in the Appendix.

Tables 12 through 14 detail the Rouge-L scores, showcasing X-Phishing-Writer’s adeptness at generating phishing emails across a broad linguistic spectrum. Correspondingly, Tables 15 through 17 illustrate the BERTScore outcomes, further corroborating the model’s commendable performance across diverse languages.

Table 5.

Member (Group)	No. of People	Title	Topics	Open Rates	CTR
Student(Younger)	830	Shopee Shopping Abnormal Notification 蝦皮購物異常通知	Authority	0.3	0.1
Student(Younger)	830	Higher Education Career Counseling Subsidy Program 青年署-大專校院職涯輔導補助計畫	Scarcity	0.28	0.05
Staff (Older)	267	Shopee Shopping Abnormal Notification 蝦皮購物異常通知	Authority	0.13	0.46
Staff (Older)	267	Cancer Health Information 癌症健康資訊	Reciprocity	0.05	0.19
Faculty (Older)	585	Shopee Shopping Abnormal Notification 蝦皮購物異常通知	Authority	0.19	0
Faculty (Older)	585	Cancer Health Information 癌症健康資訊	Reciprocity	0.11	0

Table 5. Results of Social Engineering Experiments

Table 6.

Table 7.

Table 8.

Table 9.

Table 10.

Table 11.

Notwithstanding its overall success, X-Phishing-Writer encounters challenges with certain low-resource languages, notably Korean and Japanese, where its performance dips. This diminished effectiveness is posited to stem from the inherent limitations of the mBART model in processing these languages, which diverge significantly in grammar, vocabulary, and other linguistic features from languages more closely aligned with English. Such discrepancies underscore the complex nature of cross-lingual NLP and the critical importance of language-specific considerations in model development. Addressing these linguistic variations demands a focused approach to incorporating language-specific attributes into the model, aiming to elevate cross-lingual generation quality.

Our evaluation reveals that X-Phishing-Writer excels particularly in low-resource language contexts, affirming its utility in navigating the intricacies of linguistic diversity. Nonetheless, it becomes apparent that the model’s advantages are less distinct when dealing with languages linguistically similar to English. This observation underscores the nuanced challenges of cross-lingual NLP and highlights areas for future enhancement and research, particularly in refining the model’s adaptability to a wider range of language families.

4.5 Result on Simulated Social Engineering Testing

We adopt psychological principles proposed by Ferreira and Lenzini [2015] and also refer to the experimental results from Lin et al. [2019] concerning young and older subjects. According to Lin et al. [2019], young users are more susceptible to authority and scarcity issues, while older users are more influenced by authority and reciprocity issues.

We monitor two key metrics: e-mail open rate and click-through rate (CTR). The e-mail open rate represents the percentage of recipients who open the e-mail, serving as a measure of the effectiveness of the subject line. The CTR measures how many people clicked on hyperlinks within the e-mail content. Since the CTR indicates the percentage of subscribers who clicked the e-mail, it effectively illustrates over time what portion of the audience remains engaged with the e-mail content.

In Table 5, we present the results of our social engineering experiments targeting different demographics. For the young group of 830 subjects, we sent e-mails involving authority-related topics, resulting in an e-mail open rate of 0.30 and a CTR of 0.10. Regarding scarcity issues, we obtained an e-mail open rate of 0.28 and a CTR of 0.05. In comparison, for the authority-related e-mails sent to 267 staff members and 585 teachers, the staff’s e-mail open rate was 0.13 nd teachers’ was 0.19, with an average of 0.16. However, concerning CTR, staff members had 0.46, while teachers had 0, resulting in an average of 0.23. For reciprocity issues, we achieved e-mail open rates of 0.05 and 0.11, averaging at 0.08. The CTR was 0.19 for staff members and 0 for teachers, averaging at 0.095.

It’s worth noting that the phenomenon of higher CTR than e-mail open rates was observed. This might be due to Gmail’s protective mechanisms against monitoring e-mail open behavior, which could hinder accurate detection of e-mail openings. However, we can precisely monitor whether hyperlinks were clicked.

From the above discussion, it’s evident that employing psychological principles in social engineering experiments across different demographics yielded significant effects. Particularly, exceptional results were obtained in both control groups involving authority-related scenarios. This aligns with the findings of Lin et al. [2019], verifying the value of using generative e-mails for social engineering training. Especially in the face of possible upcoming disasters, organizations should adopt our proposed generative framework to swiftly and securely establish their social engineering training programs.

Lastly, we also observed that teachers demonstrated a higher level of defense in terms of CTR. This could be attributed to their higher education, making them more sensitive to language and better at identifying potential flaws in the grammar or wording of the generated phishing e-mails. This discovery offers valuable clues for future research on strategies to educate and train individuals to counter social engineering.

4.6 Ablation Studies

In this subsection, we present the results of examining how pre-training methods and adapters affect the performance of X-Phishing-Writer. The first set of experiments focuses on evaluating the benefits of adopting the generative pre-training for zero-shot cross-lingual generation capabilities. The second set of experiments explores the advantages brought by various adapters in X-Phishing-Writer, aiming to assess the impact of omitting adapters on model performance.

4.6.1 The Impact of Task-specific Pre-training on X-Phishing-Writer.

To ascertain the benefits of task-specific pre-training on the zero-shot cross-lingual generation capabilities of our model, we conducted a series of ablation experiments under various data conditions. These experiments aimed to compare the effectiveness of task-specific pre-training against a denoising sequence-to-sequence (Seq2Seq) pre-training approach. The outcomes of these experiments offer insights into the pre-training methods that most significantly enhance phishing e-mail generation.

Zero-shot Setting: In this initial scenario, our evaluation reveals that models employing task-specific pre-training exhibit superior performance in both D-ACC and BERTScore metrics, as detailed in Table 6. This improvement underscores the pivotal role of task-specific pre-training in boosting the model’s proficiency in phishing e-mail generation across languages.

Few-shot Setting: Extending our analysis to the Few-shot setting, we observe nuanced performance differences between the two pre-training strategies. Despite task-specific pre-training slightly lagging behind in Rouge metrics, it showcases enhanced outcomes in BLEU-1, BERTScore, and D-ACC score, as shown in Table 7. This pattern suggests that task-specific pre-training maintains its effectiveness, even when the model is exposed to a modest amount of target language data, thereby improving the model’s performance significantly.

Full-shot Setting: When the model is trained with a comprehensive dataset (Full-shot setting), task-specific pre-training continues to show an upward trend in performance across most metrics. As reported in Table 8, though it marginally trails behind the denoising Seq2Seq pre-training in D-ACC, the observed differences are believed to be within a negligible range. This outcome highlights the capacity of task-specific pre-training to substantially enhance the model’s domain-specific generation capabilities given ample training data.

In conclusion, the results from our ablation studies clearly demonstrate the efficacy of task-specific pre-training in improving the performance of X-Phishing-Writer under varied data conditions. This is evident in metrics evaluating text fluency and domain specificity. These findings reinforce the critical importance of incorporating task-specific pre-training in NLG tasks, paving the way for more nuanced and effective model training approaches.

4.6.2 The Impact of Adapters on X-Phishing-Writer.

This subsection presents the results of ablation studies conducted to assess the role of various adapters in X-Phishing-Writer’s performance. Through these experiments, we aim to elucidate the contributions of individual adapters to the model’s effectiveness in phishing e-mail generation across languages.

Zero-shot Setting:. The findings, presented in Table 9, highlight the pivotal role of adapters:

—

Removing Inverse Adapter: Exclusion of the Inverse Adapter significantly hampered the model’s performance, indicating its crucial role in translating the knowledge from English to Chinese phishing e-mails effectively. The absence of this adapter significantly affected the generation of Chinese phishing e-mails.

—

Removing Generative Language Adapter: The removal of GLA led to a notable drop in D-ACC, underscoring its importance in providing language-specific representations and thereby enhancing cross-lingual content generation capabilities. The marked performance decline post-GLA removal rendered the effects of omitting other adapters less discernible.

Few-shot and Full-shot Settings:. In scenarios where a limited amount of target language text is available (as shown in Tables 10 and 11), we observed:

—

Removing Inverse Adapter: Interestingly, model performance improved upon the removal of the Inverse Adapter in settings with sufficient target language data. This suggests that while the model benefits from direct exposure to target language information, the transformation process via the Inverse Adapter might inadvertently diminish language representation fidelity. Nonetheless, the Inverse Adapter’s presence still positively impacts the D-ACC metric, highlighting its utility in aligning the model more closely with domain-specific requirements.

—

Removing Generative Language Adapter and Phishing Task Adapter: The absence of either GLA or PTA adversely affects the model’s D-ACC, reaffirming their critical roles in bolstering cross-lingual representation and domain-specific learning capabilities, respectively.

—

Removing Specific Task Pre-training: Eliminating task-specific pre-training results in a general performance downturn, underscoring the significance of this preparatory step in enhancing the overall efficacy of the model.

These ablation studies collectively demonstrate the contributions of the Inverse Adapter, GLA, PTA, and task-specific pre-training to the X-Phishing-Writer’s performance. Notably, the adapters’ influence varies across different settings, emphasizing the need for a nuanced understanding of their roles in cross-lingual and domain-specific NLP tasks.

4.7 Human Evaluation

In this subsection, we present the human evaluation results. We asked native Chinese speakers to conduct manual evaluations, and the experimental results are shown in Figure 8. The quality of model-generated phishing e-mails and ground truth phishing e-mails translated through a reputable translation service was evaluated using four metrics: Fluency, Relatedness, Correctness, and In Domain.

Fig. 8.

—

Fluency: Measures the smoothness of the generated text. For English generation, this indicates fluency.

—

Relatedness: Measures the relevance between the input and the generated (output) text.

—

Correctness: Measures the correctness of semantics and grammar in the generated text.

—

In Domain: Measures whether the generated text is a phishing e-mail.

Discussing the results, the Baseline model mBART-K2T performed poorly across all four metrics. In the Zero-shot generation setting, mBART-K2T struggled to produce fluent Chinese phishing e-mails, whereas our proposed X-Phishing-Writer was capable of generating phishing e-mails related to the given keywords. In the Few-shot generation setting, X-Phishing-Writer outperformed mBART-K2T across all four metrics, even surpassing the quality of the Ground Truth. Our observations indicated that X-Phishing-Writer can produce richer and more emotionally engaging descriptions, aligning better with the goals of phishing e-mails. Examples of relevant generated content can be found in Tables 18 and 19 marked with blue lines.

4.8 Case Studies

We designed two scenarios, (1) e-commerce and (2) finance, to provide a value-based discussion of the generated results. First, the e-commerce scenario was selected because the majority of phishing e-mails request recipients to verify their accounts, leading to the theft of account credentials. Second, we chose the finance scenario due to the fact that Business E-mail Compromise attacks often require recipients to perform actions such as account confirmation or online transfers.

From Table 18, it can be observed that in the Zero-shot setting, X-Phishing-Writer exhibits a certain level of e-mail writing capability. While mBART-K2T is guided by task-specific cues to understand keywords, it generates e-mail content in English. In the Few-shot setting of X-Phishing-Writer, rich phishing e-mail content is generated with only 30% of the training data.

Table 19 illustrates the finance scenario, where the generation pattern is similar to the previous discussion. Upon closer inspection, e-mails generated by X-Phishing-Writer are even more sophisticated.

5 The Ethical Impact of Automated Phishing Generation

—

Responsible Usage and Ethical Guidelines: We underline the importance of responsible use of our technology, advocating for the development and adherence to stringent ethical guidelines that govern the application of such tools. Our goal is to ensure that our contributions are utilized to enhance security measures and awareness, particularly through conducting controlled social engineering exercises within corporate environments. This approach aims to fortify defenses against phishing attacks by preparing individuals and systems to recognize and respond to such threats more effectively.

—

Dual Nature of Technological Innovations: We recognize that every technological advancement, including our X-Phishing-Writer, harbors the potential for both positive applications and negative exploitation. Originally conceived to generate simulated social engineering e-mails for internal security training purposes, our technology, unfortunately, possesses the intrinsic potential to be misappropriated for creating phishing e-mails targeting unsuspecting users. We stress the necessity for appropriate regulatory oversight and transparency concerning the capabilities and maturity of our technology. Such measures are imperative to mitigate the risk of abuse and to guide the technology’s application toward beneficial outcomes rather than ill uses.

6 Conclusion

In this article, we introduce the X-Phishing-Writer, a cross-lingual text generation framework tailored for the generation of phishing e-mails. This research highlights the framework’s capability in facilitating cross-task and cross-lingual knowledge transfer within text generation models, showcasing its compatibility with various existing pre-trained multilingual models.

Furthermore, a comprehensive analysis of the role of adapters within the model was conducted. Through a series of ablation studies, we assessed the contribution of each adapter to the model’s cross-lingual performance. Our findings underscore the efficacy of integrating Cross-lingual Adapters with task-specific pre-trained models in enhancing transfer learning processes.

The applicability of the X-Phishing-Writer was also tested in real-world social engineering contexts, wherein phishing e-mail content was crafted. The outcomes of these exercises affirm the framework’s utility, with generated phishing e-mails mirroring the intricacy of e-mail content. This resemblance bolsters the framework’s potential as a valuable tool for organizational employee training programs, aligning with insights from previous studies.

For a future step, we aim to explore the application of X-Phishing-Writer to other text generation tasks, such as the creation of medical records. Moreover, we plan to investigate the integration of other relevant pre-trained models to boost the framework’s cross-lingual text generation capabilities, especially in contexts involving low-resource languages and those less commonly addressed in research.

Footnotes

https://www.monkey.org/~jose/phishing/

https://www.cs.cmu.edu/~enron/

A Evaluation of 25 Languages

A.1 Rouge-L

Table 12.

Table 13.

Table 14.

A.2 BERT-score

Table 15.

Table 16.

Table 17.

B The Results of Phishing E-mail for Chinese Languages on X-Phishing-Writer

Table 18.

Table 19.

References

[1]

Alan Ansell, Edoardo Maria Ponti, Jonas Pfeiffer, Sebastian Ruder, Goran Glavaš, Ivan Vulić, and Anna Korhonen. 2021. MAD-G: Multilingual adapter generation for efficient cross-lingual transfer. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21), 4762–4781.

Google Scholar

[2]

AtlasVPN. 2021. A Record 2 Million Phishing Sites Reported in 2020, Highest in a Decade. Report. https://www.forbes.com/sites/simonchandler/2020/11/25/google-registers-record-two-million-phishing-websites-in-2020/

Google Scholar

[3]

Jeffrey Avery, Mohammed Almeshekah, and Eugene Spafford. 2017. Offensive deception in computing. In 12th International Conference on Cyber Warfare and Security (ICCWS’17), 23–31.

Google Scholar

[4]

Shahryar Baki, Rakesh Verma, Arjun Mukherjee, and Omprakash Gnawali. 2017. Scaling and effectiveness of email masquerade attacks: Exploiting natural language generation. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security. 469–482.

Digital Library

Google Scholar

[5]

Yun-Nung Chen and Alexander Rudnicky. 2014. Two-stage stochastic natural language generation for email synthesis by modeling sender style and topic structure. In Proceedings of the 8th International Natural Language Generation Conference (INLG’14). 152–156.

Crossref

Google Scholar

[6]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8440–8451.

Crossref

Google Scholar

[7]

Avisha Das and Rakesh Verma. 2019. Automated email generation for targeted attacks using natural language. arXiv preprint arXiv:1908.06893 (2019).

Google Scholar

[8]

Giuseppe Desolda, Lauren S. Ferro, Andrea Marrella, Tiziana Catarci, and Maria Francesca Costabile. 2021. Human factors in phishing attacks: A systematic literature review. ACM Computing Surveys (CSUR) 54, 8 (2021), 1–35.

Digital Library

Google Scholar

[9]

Ana Ferreira and Gabriele Lenzini. 2015. An analysis of social engineering principles in effective phishing. In 2015 Workshop on Socio-technical Aspects in Security and Trust, 9–16.

Digital Library

Google Scholar

[10]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770–778.

Crossref

Google Scholar

[11]

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In Proceedings of the 36th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 2790–2799. https://proceedings.mlr.press/v97/houlsby19a.html

Google Scholar

[12]

Kuan-Hao Huang, I.-Hung Hsu, Prem Natarajan, Kai-Wei Chang, and Nanyun Peng. 2022. Multilingual generative language models for zero-shot cross-lingual event argument extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 4633–4646.

Crossref

Google Scholar

[13]

IBM. 2014. IBM Security Services 2014 Cyber Security Intelligence Index. Report. https://i.crn.com/sites/default/files/ckfinderimages/userfiles/images/crn/custom/IBMSecurityServices2014.PDF

Google Scholar

[14]

Mahmoud Khonji, Youssef Iraqi, and Andrew Jones. 2013. Phishing detection: A literature survey. IEEE Communications Surveys & Tutorials 15, 4 (2013), 2091–2121.

Crossref

Google Scholar

[15]

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. [n. d.]. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7871–7880.

Google Scholar

[16]

Il-kwon Lim, Young-Gil Park, and Jae-Kwang Lee. 2016. Design of security training system for individual users. Wireless Personal Communications 90 (2016), 1105–1120.

Digital Library

Google Scholar

[17]

Tian Lin, Daniel E. Capecci, Donovan M. Ellis, Harold A. Rocha, Sandeep Dommaraju, Daniela S. Oliveira, and Natalie C. Ebner. 2019. Susceptibility to spear-phishing emails: Effects of internet user demographics and email content. ACM Transactions on Computer-Human Interaction (TOCHI) 26, 5 (2019), 1–28.

Digital Library

Google Scholar

[18]

Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics 8 (2020), 726–742.

Crossref

Google Scholar

[19]

Kaushal Kumar Maurya, Maunendra Sankar Desarkar, Yoshinobu Kano, and Kumari Deepshikha. 2021. ZmBART: An unsupervised cross-lingual transfer framework for language generation. arXiv preprint arXiv:2106.01597 (2021).

Google Scholar

[20]

Jamshaid G. Mohebzada, Ahmed El Zarka, Arsalan H. BHojani, and Ali Darwish. 2012. Phishing in a university community: Two large scale phishing experiments. In 2012 International Conference on Innovations in Information Technology (IIT’12), 249–254.

Crossref

Google Scholar

[21]

Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. 2021. AdapterFusion: Non-destructive task composition for transfer learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 487–503.

Crossref

Google Scholar

[22]

Jonas Pfeiffer, Francesco Piccinno, Massimo Nicosia, Xinyi Wang, Machel Reid, and Sebastian Ruder. 2023. mmT5: Modular multilingual pre-training solves source language hallucinations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’23), 1978–2008.

Google Scholar

[23]

Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aishwarya Kamath, Ivan Vulić, Sebastian Ruder, Kyunghyun Cho, and Iryna Gurevych. 2020a. AdapterHub: A framework for adapting transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 46–54.

Crossref

Google Scholar

[24]

Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebastian Ruder. 2020b. MAD-X: An adapter-based framework for multi-task cross-lingual transfer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20), 7654–7673.

Crossref

Google Scholar

[25]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67.

Google Scholar

[26]

Sayak Saha Roy, Krishna Vamsi Naragam, and Shirin Nilizadeh. 2023. Generating phishing attacks using ChatGPT. arXiv preprint arXiv:2305.05133 (2023).

Google Scholar

[27]

Steve Sheng, Mandy Holbrook, Ponnurangam Kumaraguru, Lorrie Faith Cranor, and Julie Downs. 2010. Who falls for phish? A demographic analysis of phishing susceptibility and effectiveness of interventions. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 373–382.

Digital Library

Google Scholar

[28]

Steve Sheng, Bryant Magnien, Ponnurangam Kumaraguru, Alessandro Acquisti, Lorrie Faith Cranor, Jason Hong, and Elizabeth Nunge. 2007. Anti-phishing phil: The design and evaluation of a game that teaches people not to fall for phish. In Proceedings of the 3rd Symposium on Usable Privacy and Security, 88–99.

Digital Library

Google Scholar

[29]

Cybersecurity Ventures. 2020. Cybercrime to Cost the World $10.5 Trillion Annually by 2025. Report. https://cybersecurityventures.com/cybercrime-damage-costs-10-trillion-by-2025

Google Scholar

[30]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.

Google Scholar

[31]

Rick Wash and Molly M. Cooper. 2018. Who provides phishing training? Facts, stories, and people like me. In Proceedings of the 2018 Chi Conference on Human Factors in Computing Systems, 1–12.

Google Scholar

[32]

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 483–498.

Crossref

Google Scholar

[33]

Gaoqing Yu, Wenqing Fan, Wei Huang, and Jing An. 2020. An explainable method of phishing emails generation and its application in machine learning. In 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC’20), Vol. 1. IEEE, 1279–1283.

Crossref

Google Scholar

Index Terms

X-Phishing-Writer: A Framework for Cross-lingual Phishing E-mail Generation
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Natural language generation
2. Security and privacy
  1. Intrusion/anomaly detection and malware mitigation
    1. Social engineering attacks
      1. Phishing

Recommendations

Indicators of employee phishing email behaviours: Intuition, elaboration, attention, and email typology
Highlights
- Response behaviours to phishing emails can strengthen or undermine cyber security.
Abstract
Employees’ behaviour to phishing emails can strengthen or undermine business organisations’ cyber security. This phishing simulation and survey study explored the relationship between sociodemographic, cyber security training, phishing ...
How Experts Detect Phishing Scam Emails
CSCW

Phishing scam emails are emails that pretend to be something they are not in order to get the recipient of the email to undertake some action they normally would not. While technical protections against phishing reduce the number of phishing emails ...
Phishing interrupted: The impact of task interruptions on phishing email classification
Highlights
- Exploring the effect of interruptions on the ability to classify phishing emails.
Abstract
Email is a pervasive form of communication in both personal and professional settings. The extent to which a human user can accurately detect phishing emails impacts the amount of risk they are exposed to. Previous research has ...

Comments

Information & Contributors

Information

Published In

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 23, Issue 7

July 2024

254 pages

EISSN:2375-4702

DOI:10.1145/3613605

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2024

Online AM: 03 June 2024

Accepted: 21 May 2024

Revised: 19 March 2024

Received: 24 August 2023

Published in TALLIP Volume 23, Issue 7

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Smart Sustainable New Agriculture Research Center (SMARTer)
NSTC Taiwan Project

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
1,253
Total Downloads

Downloads (Last 12 months)1,253
Downloads (Last 6 weeks)250

Reflects downloads up to 26 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Authors	Task	Adapter	Transformer Architecture	Strategy	Base Model
MAD-X [Pfeiffer et al. 2020b]	NLU	\(\checkmark\)	Encoder	MLM Pre-training	XLM-R
MAD-G [Ansell et al. 2021]	NLU	\(\checkmark\)	Encoder	MLM Pre-training	XLM-R, mBERT
ZmBART [Maurya et al. 2021]	NLU	-	Encoder+Decoder	Task-specific Pre-training	mBART
X-Gear [Huang et al. 2022]	NLU	-	Encoder+Decoder	Template	mBART,mT5
mmT5 [Pfeiffer et al. 2023]	NLG	\(\checkmark\)	Encoder+ Decoder	Denoise Seq2Seq Pre-training	mT5
X-Phishing-Writer [Ours, 2023]	NLG	\(\checkmark\)	Encoder+ Decoder	Task-specific Pre-training	mBART,mT5

Abstract

1 Introduction

2 Related Work

2.1 Defense Strategy with Social Engineering Drill

2.2 Phishing E-mail Generation

2.3 Cross-lingual Transfer Learning

2.3.1 Multilingual-based Methods.

2.3.2 Adapter-based Methods.

3 Methodology

3.1 Adapter Review

3.2 X-Phishing-Writer Framework

3.3 X-Phishing-Writer Adapters

3.3.1 Invertible Adapters.

3.3.2 Generative Language Adapter (GLA).

3.3.3 Phishing Task Adapters (PTA).

3.3.4 Cross-lingual Phishing Email Generation.

4 Performance Evaluation

4.1 Experiment Settings

4.2 Data Preparation and Preprocessing

4.3 Compared Models

4.4 Performance Evaluation Results

4.4.1 Results under Zero-shot, Few-shot, and Full-shot.

4.4.2 Effects of Varying the Few-shot Setting Size.

4.4.3 Performance Comparison with mmT5-Adapted.

4.4.4 Performance on Other Language Settings.

4.5 Result on Simulated Social Engineering Testing

4.6 Ablation Studies

4.6.1 The Impact of Task-specific Pre-training on X-Phishing-Writer.

4.6.2 The Impact of Adapters on X-Phishing-Writer.

4.7 Human Evaluation

4.8 Case Studies

5 The Ethical Impact of Automated Phishing Generation

6 Conclusion

Footnotes

A Evaluation of 25 Languages

A.1 Rouge-L

A.2 BERT-score

B The Results of Phishing E-mail for Chinese Languages on X-Phishing-Writer

References

Index Terms

Recommendations

Indicators of employee phishing email behaviours: Intuition, elaboration, attention, and email typology

How Experts Detect Phishing Scam Emails

Phishing interrupted: The impact of task interruptions on phishing email classification

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations