4.1 Experiment Settings
In this study, we conduct two sets of experiments to evaluate the performance of X-Phishing-Writer: (1) a quality evaluation of Cross-Lingual Phishing E-mails and (2) an effectiveness evaluation of these e-mails within a SED setting. To facilitate these experiments, we utilize the Nazario Phishing Corpus and the Enron e-mail dataset to create a new phishing e-mail dataset. This dataset is segmented into training, validation, and testing splits, following a 3,000/1,000/1,000 distribution of English e-mails, specifically for training the PTA.
For evaluating the impact of the SED, we employ GoPhishing, an open-source software, as the exercise platform. Participants include students and faculty members from a university, with 830 students in the younger group, and 267 staff members alongside 585 faculty members in the older group. During a 10-day period, one phishing e-mail is sent daily to the participants’ campus Gmail accounts. We monitor metrics, such as e-mail open rates and click-through rates, to assess the effectiveness of the phishing simulation.
It is important to emphasize the ethical considerations of our study. The experiment is conducted anonymously, ensuring no personal information about the participants is recorded. Furthermore, we obtain explicit consent from the university before initiating the experiment, affirming our commitment to maintaining the highest standards of research integrity and participant privacy.
During training, we use AdamW as the optimizer and an initial learning rate of 2e-5 for multilingual language models. with a batch size of 8, texts max length set to 128, and the maximum number of epochs set to 30. All experiments are conducted using two NVIDIA TITAN RTX GPUs.
4.2 Data Preparation and Preprocessing
In our study, as mentioned, we leverage the Nazario phishing e-mail corpus and the Enron e-mail dataset as our foundational data sources. From the Nazario corpus, 3,000 phishing e-mails were extracted, complemented by 2,000 fraudulent (non-spam) e-mails from the Enron dataset. These collections were amalgamated and segmented into training, validation, and testing sets, consisting of 3,000, 1,000, and 1,000 e-mails respectively. English served as the principal source language throughout our experiments, while we also aimed to extend our analysis across an additional 24 target languages. Given the scarcity of phishing e-mails in languages other than English, Google Translate was employed to facilitate the translation of e-mail content. These translated versions played a crucial role not just in assessing the model’s proficiency in generating content across various languages but also in training the PTAs within a few-shot learning framework, thereby broadening their capability to recognize phishing endeavors in diverse linguistic contexts.
Our evaluation principally scrutinized the cross-lingual efficacy of X-Phishing-Writer, spanning 25 distinct languages. We procured XML dump files of the official Wikipedia dataset, dated 1 July 2023, encompassing 24 languages. Utilizing the Wiki Extractor tool alongside bespoke scripts, we curated 3,000 Wikipedia articles for each language, leading to a total of 3,000 articles per language. This process entailed data cleaning and the extraction of vital elements such as titles and main text. Keywords were subsequently derived from each article using the BM25 algorithm, furnishing both training and testing material for the GLA.
4.4 Performance Evaluation Results
In this subsection, we delve into the performance differences between the X-Phishing-Writer and compared baseline models. Our primary evaluation focus is on the generated results in Chinese. Additionally, we conduct human evaluation, primarily comparing the generation outcomes of X-Phishing-Writer against those of the best baseline model.
For automated evaluation, we employ lexical matching metrics (BLEU and ROUGE) as well as embedding-based evaluation metrics (BERTScore). To assess the phishing e-mail generation task, we utilize the BLEU-1 (BL) score, ROUGE(R1,R2) score, ROUGE-L (RL) score, and BERTScore (BS) score, where BS incorporates the Multilingual-BERT model. These evaluation metrics aid us in the automated assessment of the quality of model-generated outputs.
In our study, we extend beyond conventional automated evaluation metrics for generation quality by introducing a specialized evaluation framework. This framework includes the Phishing-Classifier, a pivotal tool designed to discern the authenticity of the generated content’s alignment with genuine phishing e-mails.
Classifier Training and Purpose: We develop a classifier, denoted as C, trained on a corpus of Chinese phishing e-mails alongside Wikipedia datasets, utilizing the BERT-Chinese large model. The classifier’s primary objective is to investigate the potential influence of the GLA on the content generated by the PTA. Impressively, this classifier achieves an F1 score of 99.5%, indicating its high reliability in distinguishing phishing content from non-phishing content.
Domain-Accuracy (D-ACC) Metric: To specifically assess the domain relevance of model-generated content, we introduce the
Domain-Accuracy (D-ACC) metric. This metric is designed to evaluate the extent to which generated content adheres to the phishing e-mail domain, distinguishing it from unrelated Wikipedia content. The application of the Phishing-Classifier in this context enables a precise measurement of the generated content’s domain accuracy, as outlined in the following equation:
Here, for a given text \(x_i\) and a total dataset size of N, the classifier C assesses whether each piece of model-generated content aligns with phishing e-mail characteristics or falls outside this domain, culminating in an aggregated accuracy score, D-ACC.
Through this approach, we expect to ensure that its outputs are accurately reflective of the phishing e-mail domain, thus affirming the model’s utility in generating contextually relevant and domain-specific content.
4.4.1 Results under Zero-shot, Few-shot, and Full-shot.
In this subsection, we categorize performance evaluation into the three following scenarios: Zero-shot, Few-shot, and Full-shot, observing how different models fare under each setting.
Zero-shot Performance: The X-Phishing-Writer exhibits a marked superiority over baseline models, as depicted in Figures
5(A) to 5(D). It shines in terms of BLEU and ROUGE scores, particularly demonstrating a notable lead in D-ACC score, with an impressive margin of up to 78.33 points. In contrast, mBART (Naive) shows lackluster performance in BS evaluation (Figure
5(E)), highlighting challenges in maintaining fluency and readability. These findings underscore the limitations of naive approaches in adapting to the K2T task and highlight the crucial role of sophisticated knowledge transfer mechanisms in achieving effective cross-lingual generalization, as evidenced by mBART-K2T’s suboptimal Zero-shot transfer capability. The Zero-shot scenario stresses the models’ ability to generalize without target language training data, emphasizing the significance of adept pre-training and transfer learning methodologies in NLP tasks.
Few-shot Setting: With the introduction of a modest number of target language examples, X-Phishing-Writer maintains its lead against baselines in key metrics (Figures
5(A) to 5(D)), notably in D-ACC (Figure
5(D)). This underscores its capability not only in cross-lingual tasks but also in generating quality phishing e-mails with limited target language data, validating the effectiveness of its learning approach.
Full-shot Analysis: Upon integrating the complete dataset of target language e-mails, mBART-K2T shows improvement in D-ACC (Figure
5(D)) but a decline in other metrics. This suggests a tendency toward generating more varied content, which, while indicating adaptability, may diverge from the intended phishing e-mail characteristics. Conversely, X-Phishing-Writer, with its Adapter-based transfer learning, consistently aligns closely with the target examples, producing high-quality phishing e-mails. This scenario highlights X-Phishing-Writer’s robust architecture and its ability to maintain fidelity to the original intent across diverse linguistic contexts, showcasing its practicality for cross-lingual phishing e-mail generation.
4.4.2 Effects of Varying the Few-shot Setting Size.
To observe the impact of training data volume on the performance of X-Phishing-Writer, we varied the size of the training set, introducing data in increments of 500 training instances to the model for training. This setting allowed for a nuanced comparison of X-Phishing-Writer with the mBART-K2T model across settings: Zero-shot, Few-shot, and Full-shot.
Zero-shot: Within the Zero-shot setting, X-Phishing-Writer consistently outperforms mBART-K2T, as detailed in Figures
6(A) to 6(D). Particularly noteworthy is its proficiency in the BS metric (Figure
6(E)), highlighting superior fluency and readability with scores reaching up to 60% without any target language data. Conversely, mBART-K2T struggles significantly in both BS and D-ACC metrics (Figure
6(D) to 6(F)), indicating a deficiency in generating coherent and recognizable phishing content.
Few-shot Scenario: Shifting focus to the Few-shot setting, where a limited set of target language examples are introduced, X-Phishing-Writer demonstrates a remarkable capacity to maintain high-quality output regardless of the data volume provided (Figure
6(A) to (D)). This contrasts with mBART-K2T, whose performance is significantly more sensitive to the amount of training data.
Full-shot Scenario: In the Full-shot context, where the model has access to an extensive set of 3,000 target language samples, mBART-K2T shows temporary improvements over X-Phishing-Writer for certain data quantities. However, this advantage does not hold consistently, with mBART-K2T’s performance experiencing a considerable decline as demonstrated in Figure
5. This suggests that, despite initial gains, mBART-K2T fails to sustain high performance levels across all sample sizes, in contrast to X-Phishing-Writer’s more stable and robust output across varying training volumes.
In conclusion, our study underscores the significant impact of employing Adapters on model performance in diverse settings, with a special focus on cross-lingual task transferability. X-Phishing-Writer, through its innovative use of Adapters, showcases exceptional capability in generating high-quality, robust cross-lingual phishing e-mails. These results provide critical insights into improving the design and functionality of NLP models, emphasizing the effectiveness of tailored transfer learning approaches.
Future avenues of research could explore further optimization of transfer learning strategies, particularly in enhancing model adaptability across various linguistic and task-specific contexts. By pushing the boundaries of current methodologies, we aim to broaden the scope of applications for NLP models, ensuring more versatile and impactful deployments in real-world scenarios.
4.4.3 Performance Comparison with mmT5-Adapted.
In this subsection, we present a performance comparison result of X-Phishing-Writer against the mmT5-Adapted under various experimental settings. For a fair comparison, we use mT5 as the base model for constructing X-Phishing-Writer. Figure
7 shows the performance comparison between X-Phishing-Writer (mT5) and mmT5-Adapted.
In the Zero-shot setting, X-Phishing-Writer notably surpasses mmT5-Adapted, underscoring our framework’s proficiency in managing low-resource conditions and its ability to produce high-quality text absent target language training data. This outcome emphasizes the framework’s adaptability to zero-source languages, reinforcing its suitability for tasks with limited linguistic resources.
Transitioning to the
Few-shot and
Full-shot settings, our evaluation focuses on generation quality and D-ACC. Both X-Phishing-Writer and mmT5-Adapted exhibit comparable capabilities in generating high-quality textual content. Nevertheless, mmT5-Adapted tends to outperform in scenarios where ample target language data is available, highlighting its efficiency in data-rich environments. However, as evidenced in Figure
7(F), X-Phishing-Writer demonstrates a significant advantage in D-ACC, showcasing its superior precision in generating domain-specific text.
In conclusion, our experimental findings validate X-Phishing-Writer’s performance in low-resource scenarios and its precision in domain-specific text generation. We believe that the performance of X-Phishing-Writer can be attributed to its design choice of utilizing shared GLA embeddings, in contrast to mmT5’s approach of employing separate adapters for different languages. This design difference significantly enhances our model’s efficiency in the zero-shot scenario, which is especially pertinent for the cross-lingual generation of phishing emails. Given the challenge of acquiring training data for phishing emails across multiple languages, the zero-shot capability emerges as a critical feature.
4.4.4 Performance on Other Language Settings.
To ascertain the efficacy of our X-Phishing-Writer, we extend our evaluation to encompass an additional 25 languages, utilizing Rouge-L and BERTScore metrics for a comprehensive assessment. The results of this analysis are shown in the Appendix.
Tables
12 through
14 detail the Rouge-L scores, showcasing X-Phishing-Writer’s adeptness at generating phishing emails across a broad linguistic spectrum. Correspondingly, Tables
15 through
17 illustrate the BERTScore outcomes, further corroborating the model’s commendable performance across diverse languages.
Notwithstanding its overall success, X-Phishing-Writer encounters challenges with certain low-resource languages, notably Korean and Japanese, where its performance dips. This diminished effectiveness is posited to stem from the inherent limitations of the mBART model in processing these languages, which diverge significantly in grammar, vocabulary, and other linguistic features from languages more closely aligned with English. Such discrepancies underscore the complex nature of cross-lingual NLP and the critical importance of language-specific considerations in model development. Addressing these linguistic variations demands a focused approach to incorporating language-specific attributes into the model, aiming to elevate cross-lingual generation quality.
Our evaluation reveals that X-Phishing-Writer excels particularly in low-resource language contexts, affirming its utility in navigating the intricacies of linguistic diversity. Nonetheless, it becomes apparent that the model’s advantages are less distinct when dealing with languages linguistically similar to English. This observation underscores the nuanced challenges of cross-lingual NLP and highlights areas for future enhancement and research, particularly in refining the model’s adaptability to a wider range of language families.
4.5 Result on Simulated Social Engineering Testing
We adopt psychological principles proposed by Ferreira and Lenzini [
2015] and also refer to the experimental results from Lin et al. [
2019] concerning young and older subjects. According to Lin et al. [
2019], young users are more susceptible to authority and scarcity issues, while older users are more influenced by authority and reciprocity issues.
We monitor two key metrics: e-mail open rate and click-through rate (CTR). The e-mail open rate represents the percentage of recipients who open the e-mail, serving as a measure of the effectiveness of the subject line. The CTR measures how many people clicked on hyperlinks within the e-mail content. Since the CTR indicates the percentage of subscribers who clicked the e-mail, it effectively illustrates over time what portion of the audience remains engaged with the e-mail content.
In Table
5, we present the results of our social engineering experiments targeting different demographics. For the young group of 830 subjects, we sent e-mails involving authority-related topics, resulting in an e-mail open rate of 0.30 and a CTR of 0.10. Regarding scarcity issues, we obtained an e-mail open rate of 0.28 and a CTR of 0.05. In comparison, for the authority-related e-mails sent to 267 staff members and 585 teachers, the staff’s e-mail open rate was 0.13 nd teachers’ was 0.19, with an average of 0.16. However, concerning CTR, staff members had 0.46, while teachers had 0, resulting in an average of 0.23. For reciprocity issues, we achieved e-mail open rates of 0.05 and 0.11, averaging at 0.08. The CTR was 0.19 for staff members and 0 for teachers, averaging at 0.095.
It’s worth noting that the phenomenon of higher CTR than e-mail open rates was observed. This might be due to Gmail’s protective mechanisms against monitoring e-mail open behavior, which could hinder accurate detection of e-mail openings. However, we can precisely monitor whether hyperlinks were clicked.
From the above discussion, it’s evident that employing psychological principles in social engineering experiments across different demographics yielded significant effects. Particularly, exceptional results were obtained in both control groups involving authority-related scenarios. This aligns with the findings of Lin et al. [
2019], verifying the value of using generative e-mails for social engineering training. Especially in the face of possible upcoming disasters, organizations should adopt our proposed generative framework to swiftly and securely establish their social engineering training programs.
Lastly, we also observed that teachers demonstrated a higher level of defense in terms of CTR. This could be attributed to their higher education, making them more sensitive to language and better at identifying potential flaws in the grammar or wording of the generated phishing e-mails. This discovery offers valuable clues for future research on strategies to educate and train individuals to counter social engineering.
4.6 Ablation Studies
In this subsection, we present the results of examining how pre-training methods and adapters affect the performance of X-Phishing-Writer. The first set of experiments focuses on evaluating the benefits of adopting the generative pre-training for zero-shot cross-lingual generation capabilities. The second set of experiments explores the advantages brought by various adapters in X-Phishing-Writer, aiming to assess the impact of omitting adapters on model performance.
4.6.1 The Impact of Task-specific Pre-training on X-Phishing-Writer.
To ascertain the benefits of task-specific pre-training on the zero-shot cross-lingual generation capabilities of our model, we conducted a series of ablation experiments under various data conditions. These experiments aimed to compare the effectiveness of task-specific pre-training against a denoising sequence-to-sequence (Seq2Seq) pre-training approach. The outcomes of these experiments offer insights into the pre-training methods that most significantly enhance phishing e-mail generation.
Zero-shot Setting: In this initial scenario, our evaluation reveals that models employing task-specific pre-training exhibit superior performance in both D-ACC and BERTScore metrics, as detailed in Table
6. This improvement underscores the pivotal role of task-specific pre-training in boosting the model’s proficiency in phishing e-mail generation across languages.
Few-shot Setting: Extending our analysis to the Few-shot setting, we observe nuanced performance differences between the two pre-training strategies. Despite task-specific pre-training slightly lagging behind in Rouge metrics, it showcases enhanced outcomes in BLEU-1, BERTScore, and D-ACC score, as shown in Table
7. This pattern suggests that task-specific pre-training maintains its effectiveness, even when the model is exposed to a modest amount of target language data, thereby improving the model’s performance significantly.
Full-shot Setting: When the model is trained with a comprehensive dataset (Full-shot setting), task-specific pre-training continues to show an upward trend in performance across most metrics. As reported in Table
8, though it marginally trails behind the denoising Seq2Seq pre-training in D-ACC, the observed differences are believed to be within a negligible range. This outcome highlights the capacity of task-specific pre-training to substantially enhance the model’s domain-specific generation capabilities given ample training data.
In conclusion, the results from our ablation studies clearly demonstrate the efficacy of task-specific pre-training in improving the performance of X-Phishing-Writer under varied data conditions. This is evident in metrics evaluating text fluency and domain specificity. These findings reinforce the critical importance of incorporating task-specific pre-training in NLG tasks, paving the way for more nuanced and effective model training approaches.
4.6.2 The Impact of Adapters on X-Phishing-Writer.
This subsection presents the results of ablation studies conducted to assess the role of various adapters in X-Phishing-Writer’s performance. Through these experiments, we aim to elucidate the contributions of individual adapters to the model’s effectiveness in phishing e-mail generation across languages.
Zero-shot Setting:. The findings, presented in Table
9, highlight the pivotal role of adapters:
—
Removing Inverse Adapter: Exclusion of the Inverse Adapter significantly hampered the model’s performance, indicating its crucial role in translating the knowledge from English to Chinese phishing e-mails effectively. The absence of this adapter significantly affected the generation of Chinese phishing e-mails.
—
Removing Generative Language Adapter: The removal of GLA led to a notable drop in D-ACC, underscoring its importance in providing language-specific representations and thereby enhancing cross-lingual content generation capabilities. The marked performance decline post-GLA removal rendered the effects of omitting other adapters less discernible.
Few-shot and Full-shot Settings:. In scenarios where a limited amount of target language text is available (as shown in Tables
10 and
11), we observed:
—
Removing Inverse Adapter: Interestingly, model performance improved upon the removal of the Inverse Adapter in settings with sufficient target language data. This suggests that while the model benefits from direct exposure to target language information, the transformation process via the Inverse Adapter might inadvertently diminish language representation fidelity. Nonetheless, the Inverse Adapter’s presence still positively impacts the D-ACC metric, highlighting its utility in aligning the model more closely with domain-specific requirements.
—
Removing Generative Language Adapter and Phishing Task Adapter: The absence of either GLA or PTA adversely affects the model’s D-ACC, reaffirming their critical roles in bolstering cross-lingual representation and domain-specific learning capabilities, respectively.
—
Removing Specific Task Pre-training: Eliminating task-specific pre-training results in a general performance downturn, underscoring the significance of this preparatory step in enhancing the overall efficacy of the model.
These ablation studies collectively demonstrate the contributions of the Inverse Adapter, GLA, PTA, and task-specific pre-training to the X-Phishing-Writer’s performance. Notably, the adapters’ influence varies across different settings, emphasizing the need for a nuanced understanding of their roles in cross-lingual and domain-specific NLP tasks.