Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Assessing AI vs Human-Authored Spear Phishing SMS Attacks:
An Empirical Study Using the TRAPD Method

Jerson Francia2, Derek Hansen2, Ben Schooley2, Matthew Taylor2, Shydra Murray2 and Greg Snow3 2Department of Electrical and Computer Engineering
3Department of Statistics
Brigham Young University, Provo, Utah 84602
Email: 2{jersno, dlhansen, ben_schooley, mtaylo48, shywilli}@byu.edu
Abstract

This paper explores the rising concern of utilizing Large Language Models (LLMs) in spear phishing message generation, and their performance compared to human-authored counterparts. Our pilot study compares the effectiveness of smishing (SMS phishing) messages created by GPT-4 and human authors, which have been personalized to willing targets. The targets assessed the messages in a modified ranked-order experiment using a novel methodology we call TRAPD (Threshold Ranking Approach for Personalized Deception). Specifically, targets provide personal information (job title and location, hobby, item purchased online), spear smishing messages are created using this information by humans and GPT-4, targets are invited back to rank-order 12 messages from most to least convincing (and identify which they would click on), and then asked questions about why they ranked messages the way they did. They also guess which messages are created by an LLM and their reasoning. Results from 25 targets show that LLM-generated messages are most often perceived as more convincing than those authored by humans, with messages related to jobs being the most convincing. We characterize different criteria used when assessing the authenticity of messages including word choice, style, and personal relevance. Results also show that targets were unable to identify whether the messages was AI-generated or human-authored and struggled to identify criteria to use in order to make this distinction. This study aims to highlight the urgent need for further research and improved countermeasures against personalized AI-enabled social engineering attacks.

1 Introduction

In today’s digital landscape, cybersecurity defenders and adversaries continuously adapt to new and emerging technologies, many of which create new cyber risks. While implementing robust security measures can significantly reduce risk posed by potential threat actors, the strength of a system is dependent upon its users [1]. Among the myriad end user vulnerabilities, and tactics employed by malicious actors, phishing remains the most common way of infiltrating systems. Phishing is a social engineering tactic that persuades victims to take an action (e.g., click on a malicious link or email attachment) that causes malicious code to run or discloses sensitive information. Phishing attacks usually take the form of email messages, telephone calls (vishing) or SMS messages (smishing) [2]. Phishing attacks are by far the most numerous, with the FBI Internet Crime Complaint Center reporting five times more phishing complaints in 2023 than any other attack category, totaling over $300 million in reported complaint losses from 2021 to 2023 [3].

Spear phishing is a targeted form of phishing that involves the use of more personalized messages utilizing specific information about a user (e.g., name, job title, home address) to make the message more believable, thus making it more difficult to differentiate from legitimate sources [4]. Targeted spear phishing poses a significant threat for the future as the method continues to grow. The Proofpoint “State of the Phish 2023” report noted that spear phishing prevalence represented approximately 74% of attacks in 2022, as opposed to bulk phishing at 8%.[5]. In instances where the pretext used against a victim matches their expectation, the attacker is likely to succeed [6]. For example, when a user expects to receive a product purchase and shipping email from a large online retailer and receives a phishing email claiming to provide a shipping update, the user is likely to fall victim [7].

Traditionally, crafting these targeted messages were less common, requiring significant time and effort on the part of the cybercriminals to carry out research and then craft specifically targeted (rather than generic) messages [7]. However, with the advancement of AI, emergent large language models (LLMs) could potentially be used to carry out spear phishing campaigns with greater efficiency and human-like accuracy [8]. This recent shift in the attack landscape necessitates a deeper understanding of the current capabilities of malicious actors for utilizing AI to conduct cyber attacks.

This study aims to investigate the potential of using state-of-the-art language AI in the context of spear phishing. There is a need to assess how current models fare in generating malicious messages for distribution to unknowing victims, and whether they currently perform on par or better than human counterparts. We also assess whether certain factors in a spear phishing message, such as personalization, tone, and word choice, contribute significantly to its effectiveness.

The implications of this research are twofold. First, awareness about the malicious capabilities of current technologies can equip cybersecurity professionals with knowledge and understanding on how to safeguard against the risks posed by AI-generated spear phishing attacks. Second, through a deeper understanding of themes and elements associated with more deceptive spear phishing attacks, cybersecurity education can evolve in tandem with emerging threats.

2 Review of Related Literature

The body of literature related to spear phishing is expanding, with numerous studies investigating the factors that lead individuals to fall for phishing attacks [6, 9, 1, 10], as well as the application of AI for detecting and generating such attacks [8, 11].

Phishing, like other forms of social engineering, relies on various principles known as weapons of influence to manipulate targets into performing actions that benefit the attacker [9, 12]. These principles include reciprocation, which appeals to the target’s desire to repay favors (e.g., by offering a gift card); liking, which leverages the users’ tendency to comply with requests from people they like or perceive similarities with; scarcity, which exploits the target’s sense of urgency in response to limited resources or availability; social proof, which utilizes peer pressure to encourage compliance; authority, which appeals to the user’s inclination to follow authoritative entities, such as government agencies or law enforcement; commitment, which pressures individuals into adhering to their perceived obligations; and perceptual contrast, which presents two distinct scenarios and prompts the user to choose the more appealing option. Malicious messages often employ these principles to exploit potential victims.

As communication methods evolve, so too does the medium of phishing. Initially, phishing primarily relied on email as its mode of communication. Recently, however, phishing campaigns have begun to adopt alternative platforms such as SMS [13] and social media [14]. There is comparatively less literature available on these more specific modes of phishing attacks.

Several studies have explored the potential of applying machine learning (ML) for generating phishing attacks. Seymour and Tully employed a neural network to create spear phishing messages using data from trained social media posts [11], while Khan et al., utilized the now-outdated GPT-2 model to generate phishing attacks [15]. These studies, which tested their ML phishing attacks against humans, achieved varying degrees of success. Advancements in AI, have significantly improved the capability of large language models to generate human-like text. Models such as GPT-4 have demonstrated an unprecedented ability to produce coherent, relevant, and persuasive content, which can potentially be leveraged for malicious purposes [8, 16]. However, literature specifically addressing spear phishing attacks generated by LLMs remains limited, as research on using LLMs for crafting phishing messages has only recently emerged.

Comparing the effectiveness of human-generated versus AI-generated phishing attacks is a relatively unexplored area of research, yet one of growing importance. While humans can craft highly targeted and convincing phishing messages, AI-generated messages can achieve similar success rates with great potential to scale [8, 11]. As a result, studies that aim to increase education about AI manipulation are emerging [17]

With the scarcity of literature on spear phishing generated by LLMs, research to reach a better understanding of these risks is essential. While many studies offer updated training methods to inform employees and end-users about phishing attacks [18], the this study aligns with these efforts while also addressing the emerging threat of LLM-generated spear phishing. By evaluating how effectively publicly available LLMs can generate convincing spear phishing attacks, we can emphasize the need for countermeasures for potential malicious AI use. This study aims to bridge the gap in the existing literature and provide insights that can inform the development of more robust cybersecurity protocols.

3 Research Questions

We posit four research questions that this study aims to address to form a comprehensive understanding of the capabilities of LLM generated spear phishing SMS messages compared to human capabilities. We hereafter refer to the research questions in this paper as RQ1, RQ2, RQ3 and RQ4:

  • RQ1.

    Are spear phishing SMS messages created by AI more convincing than those created by humans?

  • RQ2.

    What content characteristics contribute to a more convincing spear phishing message?

  • RQ3.

    Can people differentiate AI-generated spear phishing SMS messages from those generated by humans?

  • RQ4.

    What criteria do people use when identifying AI-generated spear phishing messages?

4 Methodology

This study comprises multiple steps aimed at simulating spear phishing attacks on human subjects using a novel methodology we call TRAPD, which stands for Threshold Ranking Approach for Personalized Deception. TRAPD is designed to analyze the perceived effectiveness of deceptive messages (e.g., spear smishing messages) that have been personalized to a specific target in an ethical manner. For this study, spear phishing messages were created and targeted to each subject, who then assessed whether the message would have been successful or not in a real life setting. First, personal information was gathered from willing human subjects. The anonymized personal information was then passed to human authors who crafted spear phishing messages specific to each human target. The same personal information was also used to create prompts to feed to the GPT-4 model to generate spear phishing messages. Both the human-made and the AI-made messages were then shown to each participant who rank ordered them based on their persuasiveness and evaluated each spear phishing “attack” based on whether it would 1) trick them, and 2) whether the message was made by a human or AI. The below section describes the phases of the TRAPD methodology that were conducted.

4.0 The TRAPD Methodology

One of the contributions of this paper is the introduction of the TRAPD methodology for evaluating personalized deceptive messages. While the focus of this paper has been on spear phishing SMS messages, we believe the methodology that we performed could be used to evaluate other types of deceptive content tailored to individuals (e.g., personalized disinformation, other types of spear phishing messages). We feel that TRAPD can provide valuable and unique insights. The methodology is designed to provide an ethical way to create and evaluate personalized deceptive messages. It was also designed to facilitate comparison of different messages, such as those created by humans versus AI, or those created about different topics.

At its core, the TRAPD methodology includes the following steps:

  1. 1.

    Recruiting targets willing to share personal information with potential attackers (humans or AI).

  2. 2.

    Creating personalized deceptive messages aimed at the targets.

  3. 3.

    Having targets rank order (sort) the messages from most compelling to least compelling and choose a threshold above which they would be deceived.

  4. 4.

    Having targets provide qualitative assessments of their rationale for sorting messages.

  5. 5.

    (Optionally) Having targets “label” messages with a variable of interest (e.g., whether they believe a message was created by AI or not) and then provide qualitative assessments explaining their labeling.

The following subsections explain how we implemented the TRAPD methodology in this particular study. We hope it will serve as a template for future studies that use the method to evaluate other types of personalized deceptive content.

4.1 Recruiting targets

The first phase of research comprised recruiting participants as potential targets for spear phishing attacks. Recruiting included convenience sampling from people whom the researchers know, and using printed flyers and online banner advertising that directed interested individuals to complete a Qualtrics survey to sign up, consent to participate, and share some personal information for the purposes of this study. Participants were asked to participate in an experiment and post experiment interviews to provide detailed feedback on the spear phishing messages that were crafted for them. Participants were not told at recruitment how the spear phishing messages would be generated (i.e., human vs AI).

Participants were asked to share their name and email address for study contacting purposes. In addition, gender, age group, and self-reported technical ability were collected for study analysis. Additional information specific to each participant was also asked in order to generate personal references in the targeted spear phishing messages. This information included the participant hobbies, city or state in which they live, their occupation and workplace, and something from their home that they recently posted on social media.

After the sign-up period, we had 41 candidate participants who were all sent invitations. Twenty-five responded with complete registration information and arrived on their scheduled interview day. The participant demographics are shown on Table I.

TABLE I: Target Demographics
Gender Targets
Male 13 (52%)
Female 12 (48%)
Age Group Targets
18-25 9 (36%)
26-40 10 (40%)
40+ 6 (24%)
Refer to caption
(a) Threshold Rank Ordering
Refer to caption
(b) Label as Human vs AI
Figure 1: The two phases for the target interview.

4.2 Creating personalized deceptive messages

Using the personal data gathered from each participant, spear phishing messages were generated from both the GPT-4 model and from human authors. The goal of each message was to convince the human target to click the link included in each message. The spear phishing messages that were generated were in the form of SMS. For each target, we formed three different prompts, based on the information gathered during recruitment. Each prompt focused on one of three topics: the target’s hobbies, workplace, or something they own that they recently posted on social media. Below is an example of the prompt for one of the targets (name altered):

Make a spearphishing message in the form of an SMS for Ethan Clark, a 18-25 year old Male who recently shared about their pepper plants; they would post pictures of how they are flowering and growing peppers. Have them click a link that is not from a legitimate source, but is believable.

This phase of work was conducted in two sub-phases, one for the human authors and one for GPT-4. The sub-phases below were conducted in parallel.

4.2.1 Human Generation

The human authors were undergraduate students enrolled in a university cybersecurity program during the Winter Semester 2023. Students generated spear phishing messages based on the data provided to them about their target, though names were changed to maintain participant privacy. Messages were entered into an online survey along with some demographic data about the authors including name, gender, age group, and technical ability. Authors were also asked to assess their understanding and experience relative to spear phishing. Authors were then provided with four prompts containing information about their target and then were asked to generate a message that would attempt to phish that target based on that information. After each prompt, the author was asked to assess their confidence in the ability of the message to trick the target.

Ninety-nine students participated in the survey, with the intent of creating a balanced pool of message contributions across demographics. In total, 297 messages were gathered and screened for invalid entries that were not spear phishing messages. On messages that have no link, or have placeholders for links, we modified the message to include a generic tinyurl. In cases where there are more than two valid messages for each topic (work, hobbies, social) for each target, two were selected at random to proceed to the next phase. 246 human-authored messages were retained and prepared for the next phase.

4.2.2 AI Generation

A script was written to call the GPT-4 API to generate spear phishing message outputs automatically. The same three prompts given to the student authors for each target were fed to the script to generate the AI-generated spear phishing messages. An example of a GPT-4 output is shown below:

Hey Stephanie! I came across this incredible home organization app that I think will help you streamline your daily tasks and save time. It’s been a game-changer for me! They are currently giving away a free 1-year subscription to the first 100 users who sign up. Don’t miss out on this opportunity! Here’s the link: bit.ly/organizehome4u

Stay organized!

Your friend :)

Since GPT-4 outputs are not completely reproducible, three different responses were gathered from each prompt. This resulted in 9 messages generated by the model for each target. Similar to what was done with the human-authored messages, we selected at random 2 messages for each topic, totaling 246 AI-generated messages that will be used for the next phase.

After both sub-phases completed, each target had 12 spear phishing messages, 6 made from GPT-4 and 6 were human-made. By the end of this phase 492 simulated spear phishing “attacks” were prepared for assessment.

Refer to caption
(a) Rank Distribution of AI vs Human Messages
Refer to caption
(b) Probability distribution of AI performance
Figure 2: AI vs Human performance results (Rank & Click Probability.

4.3 Target Interview

Of the 41 targets that signed up to participate, 25 were able to voluntarily come for the target interview. In this interview, targets were shown the personalized spear phishing messages from the message generation phase. At this point, they were not informed that some messages were AI-generated. Each participant consented to be recorded for future analysis. We break down this phase according to the steps in the TRAPD methodology:

4.3.1 Threshold rank order

The 12 targeted messages were shown to each target during the interview, with all messages displayed on a separate printed piece of paper. Participants were asked to arrange messages, ranking them on their ability to convince them to click on the link provided. Figure 1a shows an example of how the targets laid out the messages in the experiment, illustrating the ranked order from most to least convincing, as well as the threshold where they would have started to click the link based on the participants’ assessments.

4.3.2 Qualitative Assessment

Participants were then asked what elements in the messages led them to be tricked, as well as other feedback that they were willing to share regarding the message. They were also asked what made them more cautious of the less convincing messages.

4.3.3 Label as Human vs AI

Participants were then notified that one or more of the messages were created by an AI, and were asked to identify which messages they believe to be human-made vs computer generated. Figure 1a shows an example on how the targets accomplished this, by placing AI “markers” on the messages that they believed to be AI-generated. Similar to the previous part of the interview, they were then asked what made them think the messages they chose were made by humans or AI. At the target’s request, researchers revealed which messages were made by AI and human authors.

Audio recordings were made for all 25 interviews which were then assimilated for transcription and further analysis.

4.4 Statistical Methodology

Several statistical analyses were conducted to evaluate the difference in performance of AI-generated and human-authored spear phishing messages (RQ1), the difference in performance of particular elements in the spear phishing messages (RQ2), the overall accuracy of the targets in identifying AI-generated messages (RQ3), and whether certain factors in the messages helped them identify AI-generated messages better (RQ4).

To address RQ1, we began by comparing the average ranking of both types of messages across all the targets. To verify statistical significance, we utilized a permutation test to validate our findings. Following this, we applied a logistic regression model to predict the likelihood of a subject clicking on a link based on whether the message was AI or human. To account for potential correlations within subjects and authors, we fitted a generalized linear mixed-effects model (GLMM). To assess how similar or better the performance of AI messages are compared to human messages, we also employed a Bayesian logistic model, using a horseshoe prior on the log odds ratio, pulling estimates towards zero when evidence was weak but not excessively when evidence was strong.

To address RQ2, we compared the average rankings of the messages according to each of the three topics: Job, Hobby or Social. We also used a logistic regression model in relation to the three topics to assess whether a topic was more successful in generating clicks than the others.

To address RQ3, we tallied the correct and incorrect predictions for all targets across AI and human messages. We also developed a predictive model to assess whether subjects could correctly identify the origin of the messages based on factors (RQ4) such as the presence of emojis (Emoji), modifications to links (LinkMod), and the total number of characters in the message (CharacterCount). This is to determine if these features could significantly aid subjects in distinguishing between AI and human messages.

5 Results

This section describes the statistical and qualitative results that address our research questions.

5.1 Statistical Analysis

In this section we report on the statistical results as from the models described in the methodology. The statistical models and methods used were derived from data gathered during the target recruitment and interview phases.

5.1.1 AI vs. Human Ranking and Click Probability

To assess performance between AI and human messages (RQ1), we compared their performance based on how high they were ranked in the target interviews and how often each type of messages would be clicked by out targets. On average, AI-generated messages ranked slightly higher (6.41) than human-authored messages (6.58). Figure 2a describes the distribution of the AI and human messages across each rank, as well as their averages.

Results from our statistical analysis indicated no significant difference between the two groups, with a p-value of 0.665. To account for the non-normal distribution of ranks and the variation across targets, a permutation test was conducted, but it again showed no significant difference in ranks between AI and human-generated messages.

The logistic regression model came with similar results. The model did not find a significant difference, with a p-value of 0.182. The predicted probabilities of clicking were 21.3% for human-generated messages and 28.0% for AI-generated messages. Although the AI-generated messages had a higher predicted click rate, this difference was not statistically significant. Fitting the GLMM also yielded similar results, indicating greater variation between subjects than between authors.

The odds ratio for clicking on a link in an AI-generated message versus a human-authored one was 1.43, with a 95% confidence interval of 0.847 to 2.446. This wide interval includes 1, indicating that the increased likelihood (43%) of clicking on AI-generated messages compared to human-authored ones is not statistically significant and could range from 15% lower to 145% higher.

We employed a Bayesian logistic model to further the analysis. Using a horseshoe prior on the log odds ratio, we obtained a posterior mean odds ratio of 1.22 (95% CI: 0.86 to 2.05) with a median posterior of 1.16. From this posterior distribution, we estimated that the probability of the odds ratio being less than 1 (indicating humans are better than AI) is 19.7%, while the probability of it being greater than 1 (indicating AI is better than humans) is 80.3%. The probability of the odds ratio exceeding 1.5 (AI being more than 50% better than humans at generating clicks) is only 19.9%. A plot of the posterior distribution of the odds ratio shown in Figure 2b indicates that, while there is a long right tail suggesting a possibility of AI being significantly better, most of the probability mass is near 1, suggesting neither being definitively better.

Refer to caption
Figure 3: Rank Distribution of Each Topic

5.1.2 Topic-based Ranking & Click Probability

One of the key content characteristics of the messages (RQ2) relates to their topic: Job, Hobby, or Social media post. On average, Job ranked the highest (5.71), followed by Hobby (6.66) with Social ranking the lowest (7.13). The distribution of these rankings is shown on Figure 3. Using a Tukey post-hoc test to analyze statistical significance of the ranking of each topic, we derived the confidence intervals of each pair of topics. Results show that the mean rank for Job is significantly higher than Social, while Job-Hobby is not statistically significant, and much less for Hobby-Social.

We used a logistic regression model to determine if the topic of the message influenced the likelihood of clicking. The model showed an overall p-value of 0.0009, suggesting that the topic played a significant role in generating clicks. Job-related messages showed the highest probability of clicking (38%) while Hobby-related (19%) and Social-related (17%) messages scored much lower. Pairwise comparisons showed that Job had a significantly higher click probability compared to Hobby and Social, with no significant difference between hobby and social topics. Specifically, the comparison between Job and Hobby had an estimate of 0.9605 (95% CI: 0.1933 to 1.7277), while Job vs Social had an estimate of 1.0961 (95% CI: 0.3081 to 1.8841).

TABLE II: AI Identification Confusion Matrix
AI Human SUM
AI 78 72 150
Human 72 78 150
SUM 150 (52%) 150 (52%) 156/300 (52%)

5.1.3 Identifying Message Origin

To assess if subjects could correctly identify whether a message was AI-generated or human-authored (RQ3), we analyzed their guesses for both types of messages. Subjects correctly identified the message origin 52% of the time. Note that 50% would be expected from randomly guessing. Table II describes the confusion matrix that details their accuracy.

We also used a logistic regression model incorporating predictors such as the presence of emojis (Emoji), whether there were modified links (LinkMod), and the number of characters in the message (CharacterCount). The overall p-value of 0.3253 indicated no significant ability to distinguish between AI and human-authored messages based on these features.

5.2 Content Characteristics of Persuasive Messages

This section addresses RQ2 from a qualitative perspective. Specifically, the following subsections identify the core themes that resulted from our analysis of the verbal explanations participants gave when describing why they sorted messages as most likely to deceive them or least likely to deceive them.

5.2.1 URL

The URL was mentioned by 64% of all participants as having convincing attributes (16/25). Six participants noted how the URL can provide a sense of trust, with one noting, “BYU, I kind of trust BYU and the URL here” (T41). Three participants mentioned convincing characteristics with regards to the domain of the URL. One stated, “The biggest part of this one [URL], and I even debated whether I would put it first or second; but it has an .edu link . . . and maybe I’m very ignorant about this, but I feel like that’s harder to, like, create a fake .edu website” (T16). In contrast, another participant used similar reasoning for not clicking on a link, noting, “At first, I thought, Oh, I would click on that. But then it says google.com.org; which is weird, because I’ve never seen that before” (T07). This same participant mentioned some inaccuracies related to the domain that would cause him/her to rethink clicking on it. Another participant mentioned the use of a .net website which contributed to them clicking on the link because the domain seemed trustworthy.

Two participants were more likely to click on the link if it included the HTTPS protocol. One said that “If it looks legit, yeah, like HTTPS and the /ProvoLibrary, something like that” (T41), then they would click. Another participant said, “they all say HTTPS, which makes me feel like it’s secure” (T07). Personal association with a URL may also play a role, with one participant stating s/he was more likely to click on the link if it included his name within the URL saying that “the link actually has my name in it. That would have really like, definitely thrown me for a loop, like really made me like, actually think about clicking the link just because it’s like, oh, this is actually from maybe from BYU” (T20).

Ten participants were concerned about the use of a link shortener within the message. One mentioned that “I feel like anytime I’ve seen a tiny URL address it, like, either hasn’t been real, or it’s been like, weird or just different things” (T16). Six of the participants were less likely to click on a link if it was misspelled e.g., “facebock” instead of “facebook” (T13). Overall, the perception is that URL name, domain, and indicators within the domain (e.g., HTTPS, spelling) help create perceptions about the deceitful nature of the message.

5.2.2 Technology Communication Medium

The technology medium used for communication was mentioned by 40% (10/25) of the interviewees. One of the 10 mentioned the medium increasing their likelihood of clicking on the link. S/he mentioned that it is “not that abnormal for [them] to receive” text messages from their home city” (T13). However, most targets saw the text medium as a red flag, even when they were inclined to believe the content. This was sometimes because of warnings from trainings they had received. One explained that “at work we were warned that all, like, valid messages will be through emails” (T25) and “This one, [employer] is not going to reach out to me via text for security stuff, period” (T08). T37 mentioned that they don’t have a “company texting deal,” making them more suspicious of SMS messages. Another men-tioned that “it’s very rare that people message me, you know, we talk over Slack or we talk over WhatsApp, maybe through email” (T31). Others noted that other communication tools, such as an online learning platform (T09) or a genealogy website (T02), would not have sent them a text message. Interestingly, in all these cases, the technology communications medium was associated with the source of the message. In other words, receiving a text message was not abnormal in and of itself, but receiving a text message from a particular source was deemed either appropriate or not.

5.2.3 Context Inaccuracies

The presence of inaccuracies in the messages was mentioned by 28% (7 of 25) of the targets. These inaccuracies contributed to perceptions about message credibility. Participants identified discrepancies related to the names or characteristics of individuals or enti-ties within their professional spheres. For example, two targets remarked about the message mentioning a col-league that does not exist, stating “there’s no Mike at work” (T37) or “there’s no one named Sarah on the BYU instructional design team” (T18). Another participant expressed confusion about being questioned on matters not relevant to their professional duties, saying that they “don’t deal with payables. So why is that asking me regarding payment?” (T25). Another target noted discrepencies related to job responsibilities, asserting that such library-related decisions would “go through me at the library”, making any logistical deviation “immediately sus-picious” (T01). Regarding messages related to their hobbies, one participant raised skepticism about the message claiming a “great deal for a hiking trail” which the participant knows through experience doesn’t actually require payment for use (T30). Another target was also suspicious of the message being from a library that they “just don’t use… so they would not know me to contact me” (T08). In summary, the instances where inaccuracies were mentioned ranged from misrepresented personnel and responsibilities to factual discrepancies about the targets’ affiliations, activities, and expectations.

5.2.4 Personal Relevance

The messages being personally relevant (or irrelevant) to the participants was mentioned in 76% (19 of 25) of the interviews. Seventeen targets explained that personal relevance of the message made it more convincing. Several participants pointed to messages directly tied to their occupational responsibilities, with one saying that “because that is my job is to help people with records…this is one that I feel like I’m most likely to engage with” (T13). Another target, who is a banker at a credit union, noted the similarities between the targeted message and messages s/he receives at work. S/he noted, the “alert literally looks like the alert we get [at work] when there’s a fraud” (T33). Some participants pointed out how the spear phishing messages aligned with their personal interests. One expressed enthusiasm about a message describing a “Vineyard gardening club”, believing that “somebody from our community” may have sent it. They expressed that they “would love to get involved in that” (T31). While most participants were more convinced when messages were relevant to them, one more skeptical participant mentioned that although “this interests me and it could exist, but I’m not gonna click on it” (T08).

On the other hand, personal relevance was perceived as unconvincing to 17 of the targets. Some participants highlighted messages that didn’t align with their current activities. For example, one interviewee noted, “I’m also not actively dancing anymore, so that’s just weird that they’re offering me something like this” (T25). Another target was suspicious of a social media-related message because they “don’t have an Instagram [account]” (T34). Some targets emphasized messages that failed to pique their interest, with one saying that they are “not into Brandon Sanderson [the author]” (T37). Another mentioned that they have already rescued an animal and “don’t need more right now” (T02). Some targets mentioned messages that contained information that they had not disclosed. One target noted that the message wasn’t “legit” because they “never put [their] studio equipment online” (T31), while another mentioned that “it’s a little weird to be selected for something that you don’t apply for” (T16). Participants also raised suspicion about messages related to unfamiliar organizations or activities, mentioning that they will not apply because “I don’t know that organization” (T41), or they “…have no idea what that would be about. So I’m just going to ignore it” (T08). Finally, targets showed skepticism when messages were somewhat relevant but lacked specificity. One target noted that “there just wasn’t enough content in the message body that was specifically directed at me” (T20). Another target highlighted a message about “recent health challenges” but noted that “there’s no indication of like, who is supposedly sending the message” (T18). Participants described how personal relevance within the message provides a very strong indicator for the message being perceived as convincing or not.

5.2.5 Scarcity Principle

Urgent wording was mentioned by 32% of participants (8 of 25). Two of the eight mentioned they were more likely to click on a link if it included urgent or fear inciting language. One explained, “And because of ’suspicious activity’ under my university account, then my first thing is, well, I better click on this. Because if there’s an issue with my account, I want to fix it right away” (T34). The other target mentioned the same thing, that they would click on the link because they wanted to fix the problem as soon as possible. They explained, “And just like, I’d read through it real quick, and just click on it because I’m like, oh, shoot, that’s maybe something important that’s going on. Okay”(T06).

The remaining participants (6 of 25) said they were less likely to click on the link if it included tactics such as fear and urgency. These participants explained that time sensitive or urgent messages feel like a warning sign. “That’s always a way of like, okay, they’re trying to get you to shut off your logical brain. Put yourself into ’Oh, my gosh, we have to do this right now.’ Yeah. Right.” Another participant confirmed, “If it seems urgent to you, you’re not going to click it because it’s very dangerous” (T13). Similarly, T41 explained, “I don’t like anything where it’s like hurry, offer and send because that always makes me warry” (T07). Similar to the their concerns with time-sensitive messages, participants also discussed how fear inciting messages can cause them to logically analyze the message at a deeper level. One explained, “To the message itself, I was a little bit unsure about just like, the fact that it was so time sensitive, and that it was like, kind of threatening to lock me out of all of my educational accounts if I didn’t do what it asked for. Because that’s not typically what I’ve seen be the case” (T16). And another stated, “Yeah. And it’s the idea of a new virus for young dogs. I’m going that’s fear mongering” (T01). The scarcity principle has long been used in literature to deceive readers. For the spear phishing messages in this study, participants had mixed reactions. Some felt the intended pressure to click, while others saw the language as manipulative and as a warning sign marking potential danger.

5.2.6 Messaging Style

Ten respondents (40%) mentioned issues related to message styling, such as text formatting, tone, and structure. Four participants acknowledged that style played a role in enhancing the credibility of a message. For example, T09 affirmed that a message was “more convincing” because the formatting and tone felt “kind of personal.” T31 also emphasized formatting in relation to message personalization explaining, “First, they talk about how they found me. They loved my work. And they saw my music profile. They talk about a music festival…. The format seems like it’s from an organized festival.” A message’s “casual style” was also suggested by respondents as a way for the message to feel authentic. A message’s “less formal” style was more “enticing” and “attractive” to T41. On the contrary, two participants mentioned that the message tone was off-putting. For example, T09 identified a message as potentially a phishing attack due to its “sales-y” tone. Additionally, T18 mentioned how sales-oriented “buzzwords” indicated a message “feels more phishy.”

A quarter of respondents (6 of 25) discussed the role of emojis in the context of message style. All six asserted that emojis diminished the credibility of the message. T34 expressed that “I think a lot of emojis is kind of something that usually is a red flag for me.” Multiple respondents conveyed that they did not anticipate receiving emoji-filled messages from senders claiming to represent professional organizations. For instance, T27 said that in the case of a legitimate organization such as a city biking club, emojis would be “out of place.” Similarly, T20 explained that if a message claimed to be from their university, the presence of emojis would cause them to feel “thrown off.” From the perspective of participants, the use of emojis in messages, particularly those seemingly sent by professional groups, caused such messages to lose legitimacy.

5.2.7 Plausible Rewards

Realistic rewards significantly influenced the perceived credibility of phishing messages, as noted by 28% of the participants (7 out of 25). One participant said they would be more likely to click on a link, “because the reward is connected to something I put so much time and effort into” (T31), while another participant stressed the persuasiveness of plausible or realistic offers, mentioning, “It’s like taking you to a link to look at an offer, but it’s not a crazy insane offer” (T16). Some targets expressed enthusiasm for realistic opportunities, stating, “It’s something I’ll be really excited about, but also something that I think is realistic” (T34). Conversely, 24% of participants (6 out of 25) identified messages with rewards that seemed “too good to be true” as less convincing. One participant mentioned it’s “probably not likely” to be the “developer of the month four times a year” (T37). One of the targets questioned the credibility of messages offering free access to goods and services, stating, “It sounds too good to be true” (T34). Another target warned against messages promoting anything for free, noting, “Anything that’s free is a little too good to be true. Okay, so that’s when I would be very careful” (T27). This collective sentiment emphasized the convincing nature of realistic rewards in phishing messages while avoiding extravagant or implausible claims that trigger suspicion.

5.2.8 Sender Familiarity

Participants described how the perceived credibility of phishing messages is tied to the sender’s identity, as noted by 7 out of 25 targets (28%). One participant said the message seemed more convincing because the sender “introduced herself as [their] neighbor” (T30). Another explained the importance of a personal introduction, stating, “The sender started with ’My name is John’” (T41), which helped the target to trust the message. Several participants expressed the importance of connecting with the senders of the messages. One target was convinced by the “actual company branding” that mimicked promotional texts they had previously received (T18). Another explained, “This could be someone from my ward [church congregation], wanting me to like check out some plants or something” (T31). Participants explained that when the message sender appears to be a legitimate, familiar source, messages seemed more credible.

Shifting towards participants’ suspicions, a substantial 68% (17 out of 25) highlighted the pivotal role of the sender in raising skepticism about received messages. One participant explained that messages appear suspicious when the introduction is not consistent with other messages from that same sender, saying, “James is my boss… it’d be weird that he was introducing himself that way… that doesn’t sound like my boss” (T31). Several participants described the need to understand how the sender unexpectedly obtained their phone number and the purpose for which the sender sent a message. One target noted, “So like, how did you get my number in the first place?” (T27), while another questioned, “Like why would Lowe’s have my phone number?” (T20). Participants described how recognition of the sender’s phone number is also an important component of message believability. One participant explained, “If it’s from a number I don’t recognize at all. Then yeah, probably the first thing I would ask is who it is. And then however, if the person is like, if their name is like the same as one of my friends, then I probably would open it” (T34). Participants described that messages lacking personal connections from the sender raised suspicions, with one stating, “I don’t feel like it’s anyone who’s personally connected with me; it just seems like spam, you know?” (T31). Being familiar with the message sender was described as a strong indicator in whether subjects believed the messages to be legitimate. It is important to note that none of the study text messages sent to participants indicated a source phone number. Rather, our focus was on the content of the messages.

5.3 Human/AI Source Identification

This subsection addresses RQ3, which asks how effective targets are at identifying AI-generated spear phishing SMS messages. The statistical analysis presented earlier showed that targets could not effectively identify AI-generated messages. The qualitative results presented here reinforce the difficulty of this task for targets. Although the targets described some of their reasoning behind their guesses on which messages are AI-generated (see following section), 12 out of 25 of the participants stated in some way that they were uncertain about their decisions, often relying on intuition rather than any specific criteria. One target remarked, “sometimes it’s a gut feeling maybe more than like a specific thing you’re looking for?” (T07), while another said it’s“just a feeling” (T02). Others attributed their lack of criteria to advancements in AI, saying that they,“don’t really, really know my criteria, because I know that AI is getting so good” (T04). Several targets openly admitted their lack of expertise, saying they “have no idea” (T15), and “actually I don’t know, I don’t know” (T23). Some were unsure if AI could personalize content or include icons in text (T02), while others conveyed concerns of AI’s competency, hoping that “the ones that are worse would hopefully be the AI ones” (T34). Overall, these responses show the difficulty the targets faced in distinguishing between AI and human-generated content. In general, they did not seem to have mental models that provided meaningful guidance in identifying AI-generated messages.

5.4 AI Identification Criteria

This subsection addresses RQ4 by identifying the criteria people use to try and identify AI-generated messages.

5.4.1 Emojis

The use of emojis in messages was mentioned by 20% (5/25) of the targets as being indicative of an AI-generated message. Two concluded that the use of emojis makes a message appear to be written by AI (T25), with one target explaining, “there’s always an emoji” and that the emoji messages don’t look very personal (T30), indicating a machine wrote the message. Alternatively, three of the targets connected emoji use to human writing because they were not sure if AI could use emojis. One target claimed they didn’t think a message was AI written because of “the icon” (T02), while another explained “I’ve never asked an AI to do something with emojis. I wasn’t sure if it could. I’ve never seen that, but I wouldn’t expect emojis” (T18). The targets’ opinions on whether the message was human or AI-written seemed influenced by their awareness of AI’s capability to use emojis and their previous encounters with AI using emojis. In our spear phishing messages, emojis appeared in 66% of messages written by AI and only 2% of messages written by humans.

5.4.2 URL

The message URL was mentioned by 8% (2/25) of the targets. Both of these targets explained that the included URL led them to believe the message was AI-generated. One target mentioned that the word “dot” was spelled out rather than typed so it “seemed like something maybe an AI would do” (T13). The other target concluded that altered links, such as bitly, are evidence of AI and stated that AI “changed the URL to be something a little bit easier to read and understand” because they have used ChatGPT before and remembered an output with a shortened link. Both targets concluded that a suspicious URL component leads to the conclusion that an AI message was generated. Most (71%) human authors did not include an actual URL, instead they included placeholders such as “[URL],” “site,” or “url.” In contrast, only one AI message had a placeholder. As discussed in the methods, all of these placeholders were replaced with a shortened link from tinyurl.

5.4.3 Word Choice

The message word choice was mentioned by 24% (6/25) of the targets. Four of the targets claimed that the word choice sounded like AI, particularly the use of buzzwords and marketing words. One target explained that messages sound AI-generated when they use buzzwords that are associated with being “attention grabbers” because “AI would utilize those a lot” (T16). Another target described a message as “too specific” when it included words that humans “wouldn’t necessarily use” when describing instructional design (T18). A few of the targets agreed that some of the messages had sentences that a human would have worded differently (T09). Conversely, two targets gave evidence of human-sounding messages and both mentioned slang words. One target described “There’s so many slang terms on it, that it seems really human to me” (T34), while another defended the messages they thought to be human-created and explained, “oftentimes, it was just because I felt like the language seemed a little more slang-like” (T06). Messages with specific, marketing-like words were perceived as AI-written while messages with casual, slang words that sounded more conversational were perceived as human-written.

5.4.4 Style

The message style was mentioned by 40% (10/25) of the targets. Many targets noted that AI-generated messages tend to be, “pretty formal” (T21) or “overly informal or overly formal” (T09) while others said messages often appear “extremely generic sounding,” as if they were created from a prompt with specific parameters (T18). Additionally, the presence of “glaring flaws” such as typos and awkward phrasings were attributed to being human-made mistakes. Such common mistakes contrasted with the “more robotic” nature of AI messages (T08). Some targets observed that AI-generated messages might seem “too perfect,” such as refraining from casual texting language that humans often use, such as “RN” for “right now” (T34). One participant noted that the excessive use of exclamation points is what made them think the message is AI generated: “I mean, honestly, what’s driving me insane about all of these is exclamation points. Yeah. I’m like, why are there so many exclamation points all over?” (T07). In contrast, messages that gave a sense of urgency were more likely to be perceived as human-written, as one target explained: “Then I didn’t put this one because they were actually pressuring me to do it immediately. So I don’t think AI can do it” (T02). Overall, targets indicated that a message’s style played a role in their AI vs human generated message decisions, with AI messages being attributed to using a style that is overly formal, generic, or laden with exclamation points; and with human messages appearing more casual, urgent, and imperfect.

5.4.5 Length of Message

The length of message was mentioned by 16% (4/25) of the targets. Participants assumed messages made by AI would be longer than human made messages. One mentioned that they would have ““AI do the longer ones and I would write the shorter ones myself,” (T21) assuming those who created these messages would do the same. These four targets were correct in their hunch that AI messages would be longer. Across the messages, AI-generated messages have on average 41% higher character count per message, with 337.8 characters for AI, and 237.9 characters for human-authored messages.

5.4.6 Content/Structure

The structure of the message content was mentioned by 24% (6/25) of the targets. Participants pointed out various aspects of structure that hinted at AI involvement, including repetitive phrases and/or awkward sentence construction. For instance, one participant thought a message was AI generated because ““it looks like there’s a template. So there’s like a flow that you know, there’s a pattern you see it as maybe started by a human” (T41). Another noted instances where messages seemed generic or mass-produced, akin to “the email that is sent to everyone in the school” (T41). One also pointed out inconsistencies within messages, such as repetitive phrases or ““two different things, in the same message” (T04). One target also noted that AI-generated messages tended to be more wordy, containing “lots of filler phrases” compared to human-authored messages (T37).

5.4.7 Grammar/Spelling

The grammar and spelling of the messages was mentioned by 24% (6/25) of the targets. Four of the targets explained that the grammar and spelling was ““too perfect” (T34), which led them to believe the message was AI generated. One described that “the grammar was almost too correct” (T06), while another described the messages as “pretty” (T34). Messages that were written “not how you would actually speak” were considered to be AI-generated (T13). Three of the targets gave evidence of spelling and grammar that seemed human-like. One target said “this one’s a legitimate person right there, because they spelled the ‘util source’ wrong” (T37), while another gave a similar reasoning and stated, “I don’t think that AI would have made the grammar mistake” (T21). If the messages included a “mix of good grammar with bad or some texting language” (T34), a human author became more believable. While the participants who noted grammar and spelling were split between AI and human-based evidence, most came to the same conclusion that AI would have perfect grammar and spelling while humans would have made errors.

5.4.8 Personalization

The message personalization was mentioned by 32% (8/25) of the targets. Five participants mentioned that a lack of personalization made them suspect the message was AI-generated. For instance, one target noted, “AI were the ones that were a little bit less personal” (T34), while another expressed doubt, saying, “There is no connection with me as a reader” (T41). Some targets believed the absence of personal pronouns or a personal introduction with the recipient’s name indicated AI authorship. One target observed, “All of these messages, but two, include my name” (T37), leading them to suspect those two were AI-generated. Conversely, four participants cited personalized messages as evidence of human authorship, providing similar reasons as those who suspected AI. This group of participants felt messages tailored to the recipient sounded human, with one target commenting, “it just sounds very personal” (T21). Additionally, recipients noted that human-sounding messages often began with an introduction of the sender, such as “this is Sarah from the BYU student instructor program” (T41). Overall, personalized messages were deemed more human-like, while those lacking personalization were associated with AI.

6 Discussion

Our discussion section is organized around our key research questions, followed by a reflection on the pros and cons of the TRAPD methodology.

6.1 AI vs. Human in Creating Persuasive Spear Phishing Messages

A growing body of literature has found evidence that AI-generated content from recent LLMs can outperform human-created content in various domains [8, 19, 20, 21]. For example, Zhang and Gosline compared content generated by GPT-4 with similar content created by professional content creators in the advertising field, finding that “Content generated by generative AI and augmented AI is perceived as of higher quality than that produced by human experts and augmented human experts” [19]. Nisbett and Spaiser explored the use of GPT-3 in creating moral statements supporting climate action, concluding that “GPT-3-generated statements are generally more convincing than human-generated statements” [21]. Similarly, in the domain of political speech, Palmer and Spirling found that “LLMs are capable of producing human-style arguments for different positions on subjects as varied as abortion, guns, immigration, and organ donation” and they could “out-perform human authors, though it varies by topic” [20].

Fewer studies have examined the ability of LLMs to create content tailored to individuals, such as we examine in this paper. Heiding et al. examined phishing messages created by humans (using the V-triad approach), GPT-4, both combined, and a control message from an existing phishing dataset [22]. They found that the human V-triad and the combined human V-triad and GPT-4 approaches led to the highest click-through rate in a field experiment, followed by GPT-4 and then the control group. [22]. Although they did not personalize these messages to each individual, they did personalize them to a specific university context. They also provide a cost analysis that demonstrates how cheap the creation of spear phishing messages can be. Hazell also points out that LLM-generated spear phishing messages can be “realistic” and “cost effective,” but does not provide systematic evidence comparing humans to AI, such as presented in this paper [8].

Our study contributes to this body of literature by comparing the effectiveness of AI-created and human-created spear phishing messages tailored to an individual target. We found that AI generally outperformed humans in creating spear phishing messages; however, the difference was not statistically significant, in part due to the relatively low sample size. As presented in Section 5.1.1, there is an 80% probability that GPT-4 is at least as good at or better than humans at created these highly personalized spear phishing messages. While further research is justified in this area, we believe our findings are a clear indication of the capabilities of LLMs in generating plausible spear phishing messages. It is important to note that these results are based on a very simple prompt, which could likely be improved. Furthermore, LLMs continue to improve in their quality, suggesting that AI will likely perform better at this task in the future. However, despite the fact that our human spear phishing authors had been trained on techniques to create spear phishing, it is possible that they could have been trained better. For example, they were not trained using the V-triad approach used so successfully by Heiding et al. [22].

6.2 Characteristics of Convincing Spear Phishing Messages

Our statistical analysis found that targets considered job-related spear phishing messages twice as persuasive than messages related to hobbies or social media posts (38% intended click rate compared to 19% hobbie and 17% social media). This was true for both human and AI-generated messages. This suggests the need to be particularly vigilant for work-related spear phishing attacks.

The qualitative findings from this study point to several phishing message features that make them more convincing than others. [9, 12] Our findings aligned with prior research, revealing that messages matching the receiver’s expectations in terms of the sender, context, and relevance of a message were more convincing; those with poor grammar and spelling were less believable; and messages relating to the recipient’s life were more persuasive.

Though we found general alignment across participants in terms of the features that matter, we also found that in some cases, the features that cause one person to “avoid phishing emails makes another person fall for them,” findings that were recently reported [22]. This suggests that some features may work better for certain individuals than others - a level of personalization that we have not yet examined, but seems possible to implement in future AI-generated spear phishing attacks that have more information about targets’ preferences.

A key focus of this paper was on personalization of messages, since these were spear phishing messages tailored to individuals. Personalizing messages can lead to more persuasive messages [23, 10, 6], but can also raise red flags when the personalization is even slightly off. In this study, over two thirds of participants were likely to believe messages that related to topics of personal interest and relevance, showing the power of personalized phishing messages when done well. However, two thirds of participants also shared examples of messages that were not believed because they included content that was not personally relevant or included red flags such as organizations they had never heard of. Getting personalization right showed up in other categories as well. For example, many targets thought the technology communication medium (texting) was not appropriate for the type of message they received. Others recognized context inaccuracies, such as a colleague’s name that is made up. And while some people were more convinced by messages that had senders that were familiar, over two-thirds were turned off by inclusion of senders that were suspicious and not someone the person would know. All of these suggest the need for further research not only on what makes phishing messages persuasive, but how that interacts with personalization in spear phishing. We believe future studies can build upon the categories and insights related to spear phishing messages identified in this paper.

6.3 AI vs. Human Message Identification

This study illustrated the limitations of humans in trying to identify AI-generated spear phishing messages. Our quantitative results show that targets guessed accurately only 52% of the time, where 50% would be expected from randomly guessing. Qualitative results suggest that many individuals had no idea how to even approach this task, lacking mental models to know what criteria to use to differentiate them. This finding aligns with previous studies, such as those comparing AI-generated and human-written poetry, which demonstrate that people often struggle to differentiate between AI and human authorship [24]. Our results highlight a need for improved literacy and education for individuals against AI-driven phishing threats. One approach is to use LLMs to provide recommendations on how to deal with suspicious messages [22]. Another is to help users realize that they may expect to get more suspicious personalized messages in the future, given the lowering cost of creating spear phishing messages using AI. In the end, differentiating between AI-generated and human-generated messages is not as important as avoiding clicking on suspicious links.

Despite their lack of success, targets explained criteria they used to when guessing which messages were AI generated. Their lack of success and confidence in their guesses suggests that their criteria were more of superficial, heuristic cues when assessing the source of messages than hard and fast rules. Our results were consistent with previous research about human heuristics for AI-generated language [25]. Still, it is useful to better understand the perceived limitations that people put on AI. They described a variety of criteria including specific linguistic features (the presence of emojis, how the URLs look, word choice), as well as stylistic choices (length of message, content structure, grammar, and spelling). They commonly noted that AI-generated messages frequently exhibit peculiar word choices, overly formal or inconsistent styles, and unnatural content structures. Moreover, AI-generated texts were often perceived as either too perfect or containing subtle but noticeable grammatical and spelling errors. Also, messages that lacked specific personal details or felt generic were more likely to be identified as AI-generated, since people assumed that AI could not create highly personal content. This is particularly problematic, since this study demonstrates that, in fact, AI is likely better at creating highly personalized phishing messages than humans.

6.4 The TRAPD Methodology

There are several pros of the TRAPD methodology. First, it provides an ethical way for personalized deceptive messages to be evaluated, since it starts with targets giving consent to share their own personal information in ways that can be “weaponized,” while minimizing the effects of the deceptive messages. Second, the method provides data in a format that can statistically determine differences in effectiveness of different types of deceptive messages (e.g., AI vs Human, topic of message). Third, it allows targets to provide qualitative insights about their decision-making when ranking messages. It is difficult to collect such data from other methods (e.g., field experiments where targets receive a single message in an authentic environment). Furthermore, grounding the discussion in the actual messages themselves (e.g., “why did you place this message in the most likely to click on position?”), increases accuracy of their self-reported reasoning. Having them do this for multiple messages in one sitting also helps them recognize patterns. Finally, this method can be a positive learning experience for the targets who learn about their own thinking and what is most and least likely to deceive them.

As with all methods, there are also drawbacks. These drawbacks primarily come from the fact that the targets provide self-reported data in a lab environment. For example, it is possible that participants say they would click on a link in a phishing message, when in fact they would not, or visa versa. Targets may have a social desirability bias, which would lead them to be less likely to admit falling for a deceptive message in the lab environment. Although it’s hard to know how strong this tendency may be, the majority of our participants (80%) said they would fall for at least one message. Furthermore, those who said they are confident in their ability to identify phishing messages said they would fall for fewer of them. It is also possible that individuals would act differently in more authentic contexts than what they expect when reporting in a lab. For example, a target may be less discerning when trying to quickly reply to a text message while riding a crowded bus than if they were at home on the couch. In our particular study, we used printouts of the text messages on the outline of a mobile phone to try and trigger similar thought patterns as they might have if looking on a real phone, but the lab environment and fact that it is paper and not digital was not possible to change. Additionally, this method focuses on the “content” of the messages, but cannot give input on some contextual factors, such as the source of the message. In the end, we have stronger confidence in the TRAPD method to identify differences between deceptive messages (and types of messages) than we do in the actual percentage of messages targets self-report as being willing to click on. Our observations of sessions with targets confirmed that they were fully engaged and took the sorting and threshold identification tasks very seriously.

On a practical level, we learned several things about effectively implementing the TRAPD methodology that we hope will be useful to others in the future. We originally had planned on having targets sort 16 items, but after pilot testing, decided to lower the number to 12. Our observations suggest that it took participants about 10-15 minutes to sort 12. We would not recommend going above that number. As expected, not all targets who started the study came back for their interview. We ended up with a 61% return rate (25/41) after multiple requests for targets to return for the sorting interview. We propose that future studies plan for similar or possibly lower return rates. It is critical to make clear to targets that they are consenting to come back for an interview (if at all possible), not just to complete an initial survey. In this particular study, the final target count of 25 was low enough that we did not get statistically significant results in some areas that we likely would have with a larger sample. Because we already solicited the human generated messages that are targeted to the original targets, it is not practical to increase the sample without having to redo essentially the entire study. Thus the importance of getting a high return rate when using human-created messages. One other limitation of the current study, which can inform future studies, was the need to handle URLs consistently. We found that the GPT-4 generally created made-up links within messages, while many humans just included text such as “[URL]” or “link” instead of an actual link. We used a tinyurl.com link in such cases. Future studies may consider either using the same link within all messages, or enforcing the need for humans to create full URLs.

The TRAPD methodology can be used for a variety of contexts where personalized deceptive messages are created and evaluated. Indeed, an early study evaluating the effectiveness of a spear phishing training approach versus a control group, used a less formally documented version of the TRAPD method [18]. We anticipate future studies using TRAPD to examine personalized disinformation, interactive AI-generated vishing attacks, and other forms of spear phishing. While using human-generated messages as part of TRAPD is clearly possible, there are advantages of using only automated messages. For example, imagine running a study that compares different LLMs or different prompts. The study could recruit targets, generate personalized deceptive content, and have targets rank order and describe their reasoning all within the same online survey. This is not possible with human-generated content, due to the needed delay to create the content, but would be possible with AI-only generated content. An online tool that supports TRAPD would likely help scale up the participant numbers.

7 Conclusion

Deceptive messages that are personalized to a particular individual, such as spear phishing messages, can be highly persuasive. We are entering a new age, when LLMs can be used to generate personalized deceptive messages. This study uses a novel methodology, we call TRAPD, to compare the efficacy of deceptive spear phishing SMS messages created by humans and AI. Although our statistical results are not definitive (due to a sample size of 28), we find a high likelihood (80%) that AI outperforms humans. We also find that messages related to jobs outperform those related to hobbies and items purchased online. Participants who were targeted with spear phishing messages described reasons they classified personalized smishing messages as particularly deceptive or easy to identify as fake. These were classified into the following key categories, in order of importance: characteristics of the URL, proper use of the technology communications medium of texting, use of inaccurate information, degree of personal relevance, application of the scarcity principle, messaging style, use of plausible rewards, and characteristics of the stated sender. We also report on the failure of targets to identify which messages are created by AI versus humans. Targets were not confident in their ability to identify AI-generated content. They were often inconsistent in their assessment of the features that they thought indicated a message was AI-generated. Features that influenced their AI vs human message decisions included the message having emojis, characteristics of the message URL, word choice, message style, length of messages, structure of messages, grammar/spelling, and if and how personalization was used. In summary, current LLMs can create highly deceptive spear phishing messages personalized to a target without targets having any idea that they are created by AI. This poses significant risks for cybersecurity breaches and societal resilience.

References

  • [1] R. A. Alsharida, B. A. S. Al-rimy, M. Al-Emran, and A. Zainal, “A systematic review of multi perspectives on human cybersecurity behavior,” Technology in Society, vol. 73, p. 102258, May 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0160791X23000635
  • [2] E. O. Yeboah-Boateng and P. M. Amanor, “Phishing, SMiShing & Vishing: An Assessment of Threats against Mobile Devices,” vol. 5, no. 4, 2014.
  • [3] F. B. of Investigation, “Fbi internet crime report 2023.” [Online]. Available: https://www.ic3.gov/Media/PDF/AnnualReport/2023_IC3Report.pdf
  • [4] Symantec, “Internet security tech report.” [Online]. Available: https://docs.broadcom.com/doc/istr-24-2019-en
  • [5] “2024 State of the Phish Report: Phishing Statistics & Trends | Proofpoint US,” Feb. 2024. [Online]. Available: https://www.proofpoint.com/us/resources/threat-reports/state-of-phish
  • [6] Z. Benenson, F. Gassmann, and R. Landwirth, “Unpacking Spear Phishing Susceptibility,” in Financial Cryptography and Data Security, M. Brenner, K. Rohloff, J. Bonneau, A. Miller, P. Y. Ryan, V. Teague, A. Bracciali, M. Sala, F. Pintore, and M. Jakobsson, Eds.   Cham: Springer International Publishing, 2017, pp. 610–627.
  • [7] P. Rajivan and C. Gonzalez, “Creative Persuasion: A Study on Adversarial Behaviors and Strategies in Phishing Attacks,” Frontiers in Psychology, vol. 9, Feb. 2018, publisher: Frontiers.
  • [8] J. Hazell, “Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns,” May 2023, arXiv:2305.06972 [cs]. [Online]. Available: http://arxiv.org/abs/2305.06972
  • [9] D. S. Oliveira, T. Lin, H. Rocha, D. Ellis, S. Dommaraju, H. Yang, D. Weir, S. Marin, and N. C. Ebner, “Empirical analysis of weapons of influence, life domains, and demographic-targeting in modern spam: an age-comparative perspective,” Crime Science, vol. 8, no. 1, p. 3, Dec. 2019. [Online]. Available: https://crimesciencejournal.biomedcentral.com/articles/10.1186/s40163-019-0098-8
  • [10] T. Lin, D. E. Capecci, D. M. Ellis, H. A. Rocha, S. Dommaraju, D. S. Oliveira, and N. C. Ebner, “Susceptibility to Spear-Phishing Emails: Effects of Internet User Demographics and Email Content,” ACM Transactions on Computer-Human Interaction, vol. 26, no. 5, pp. 32:1–32:28, Jul. 2019. [Online]. Available: https://doi.org/10.1145/3336141
  • [11] J. Seymour and P. Tully, “Generative Models for Spear Phishing Posts on Social Media,” Feb. 2018, arXiv:1802.05196 [cs, stat]. [Online]. Available: http://arxiv.org/abs/1802.05196
  • [12] R. Karamagi, “A Review of Factors Affecting the Effectiveness of Phishing,” Computer and Information Science, vol. 15, no. 1, p. p20, Nov. 2021, number: 1. [Online]. Available: https://ccsenet.org/journal/index.php/cis/article/view/0/46414
  • [13] S. Mishra and D. Soni, “SMS Phishing and Mitigation Approaches,” in 2019 Twelfth International Conference on Contemporary Computing (IC3), Aug. 2019, pp. 1–5, iSSN: 2572-6129.
  • [14] M. Bossetta, “The Weaponization of Social Media: Spear Phishing and Cyberattacks on Democracy,” Journal of International Affairs, vol. 71, no. 1.5, pp. 97–106, 2018, publisher: Journal of International Affairs Editorial Board. [Online]. Available: https://www.jstor.org/stable/26508123
  • [15] H. Khan, M. Alam, S. Al-Kuwari, and Y. Faheem, “OFFENSIVE AI: UNIFICATION OF EMAIL GENERATION THROUGH GPT-2 MODEL WITH A GAME-THEORETIC APPROACH FOR SPEAR-PHISHING ATTACKS,” pp. 178–184, Jan. 2021, publisher: IET Digital Library. [Online]. Available: https://digital-library.theiet.org/content/conferences/10.1049/icp.2021.2422
  • [16] M. Mozes, X. He, B. Kleinberg, and L. D. Griffin, “Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities,” Aug. 2023, arXiv:2308.12833 [cs]. [Online]. Available: http://arxiv.org/abs/2308.12833
  • [17] P. Wilczyński, W. Mieleszczenko-Kowszewicz, and P. Biecek, “Resistance Against Manipulative AI: key factors and possible actions,” Apr. 2024, arXiv:2404.14230 [cs]. [Online]. Available: http://arxiv.org/abs/2404.14230
  • [18] J. J. Meyers, D. L. Hansen, J. S. Giboney, and D. C. Rowe, “Training Future Cybersecurity Professionals in Spear Phishing using SiEVE,” in Proceedings of the 19th Annual SIG Conference on Information Technology Education, ser. SIGITE ’18.   New York, NY, USA: Association for Computing Machinery, Sep. 2018, pp. 135–140. [Online]. Available: https://doi.org/10.1145/3241815.3241871
  • [19] Y. Zhang and R. Gosline, “Human favoritism, not AI aversion: People’s perceptions (and bias) toward generative AI, human experts, and human–GAI collaboration in persuasive content generation,” Judgment and Decision Making, vol. 18, p. e41, Jan. 2023. [Online]. Available: https://www.cambridge.org/core/journals/judgment-and-decision-making/article/human-favoritism-not-ai-aversion-peoples-perceptions-and-bias-toward-generative-ai-human-experts-and-humangai-collaboration-in-persuasive-content-generation/419C4BD9CE82673EAF1D8F6C350C4FA8
  • [20] A. Palmer and A. Spirling, “Large Language Models Can Argue in Convincing Ways About Politics, But Humans Dislike AI Authors: implications for Governance,” Political Science, vol. 75, no. 3, pp. 281–291, Sep. 2023, publisher: Routledge. [Online]. Available: https://www.tandfonline.com/doi/full/10.1080/00323187.2024.2335471
  • [21] N. Nisbett and V. Spaiser, “How convincing are AI-generated moral arguments for climate action?” Frontiers in Climate, vol. 5, no. 1193350, Jun. 2023, number: 1193350 Publisher: Frontiers Media. [Online]. Available: https://eprints.whiterose.ac.uk/201347/
  • [22] F. Heiding, B. Schneier, A. Vishwanath, J. Bernstein, and P. S. Park, “Devising and Detecting Phishing: Large Language Models vs. Smaller Human Models,” Nov. 2023, arXiv:2308.12287 [cs]. [Online]. Available: http://arxiv.org/abs/2308.12287
  • [23] P. B. Baltes, “Theoretical propositions of life-span developmental psychology: On the dynamics between growth and decline,” Developmental Psychology, vol. 23, no. 5, pp. 611–626, 1987, place: US Publisher: American Psychological Association.
  • [24] N. Köbis and L. Mossink, “Artificial Intelligence versus Maya Angelou: Experimental evidence that people cannot differentiate AI-generated from human-written poetry,” Sep. 2020, arXiv:2005.09980 [cs, econ, q-fin]. [Online]. Available: http://arxiv.org/abs/2005.09980
  • [25] M. Jakesch, J. Hancock, and M. Naaman, “Human heuristics for AI-generated language are flawed,” Proceedings of the National Academy of Sciences, vol. 120, no. 11, p. e2208839120, Mar. 2023, arXiv:2206.07271 [cs]. [Online]. Available: http://arxiv.org/abs/2206.07271