research-article

Open access

TTPXHunter: Actionable Threat Intelligence Extraction as TTPs from Finished Cyber Threat Reports

Authors:

Nanda Rani,

Bikash Saha,

Vikas Maurya,

Sandeep Kumar ShuklaAuthors Info & Claims

Digital Threats: Research and Practice, Volume 5, Issue 4

Article No.: 37, Pages 1 - 19

https://doi.org/10.1145/3696427

Published: 09 December 2024 Publication History

PDF eReader

Abstract

Understanding the modus operandi of adversaries aids organizations to employ efficient defensive strategies and share intelligence in the community. This knowledge is often present in unstructured natural language text within threat analysis reports. A translation tool is needed to interpret the modus operandi explained in the sentences of the threat report and convert it into a structured format. This research introduces a methodology named TTPXHunter for automated extraction of threat intelligence in terms of Tactics, Techniques, and Procedures (TTPs) from finished cyber threat reports. It leverages cyber domain-specific state-of-the-art natural language model to augment sentences for minority class TTPs and refine pinpointing the TTPs in threat analysis reports significantly. We create two datasets: an augmented sentence-TTP dataset of \(39,296\) sentence samples and a \(149\) real-world cyber threat intelligence report-to-TTP dataset. Further, we evaluate TTPXHunter on the augmented sentence and report datasets. The TTPXHunter achieves the highest performance of \(92.42\%\) f1-score on the augmented dataset, and it also outperforms existing state-of-the-art TTP extraction method by achieving an f1-score of \(97.09\%\) when evaluated over the report dataset. TTPXHunter significantly improves cybersecurity threat intelligence by offering quick, actionable insights into attacker behaviors. This advancement automates threat intelligence analysis and provides a crucial tool for cybersecurity professionals to combat cyber threats.

1 Introduction

In the ever-evolving landscape of cybersecurity, Advanced Persistent Threats (APTs) represent a significant challenge to worldwide security. Countering APTs requires the development of sophisticated measures, which depend on the detailed extraction and analysis of threat intelligence related to APTs [20, 30, 34]. It involves delving into the attacker’s modus operandi in terms of Tactics, Techniques, and Procedures (TTPs)¹ [16] explained in the threat reports, blogs, bulletins released by security firms [7, 9]. These reports, written in unstructured natural language, describe the modus operandi of cyber adversaries. Converting this information into a machine-readable structured format improves threat intelligence efforts and is crucial for comprehending and mitigating potential threats [3].

Extracting TTPs from such reports is also crucial for recommending defensive mechanisms. However, this process often encounters challenges, including a lack of publicly available structured data, difficulties posed by polymorphic words, the challenge of interpreting the contextual meanings of sentences present in the threat report and mapping them to TTPs [21, 22]. One of the literature combats these challenges by introducing a TTP extraction tool named TTPHunter [22], which extracts TTPs from cyber threat reports. This tool can automatically identify and catalog TTPs with the f1-score of \(0.88\) [22]. TTPHunter focused on the top 50 TTPs in the ATT&CK framework because of the limited data availability for the remainder of the TTP class. The remainder of the TTPs are particularly those that are either newly emerging or less commonly used. This gap undermines the effectiveness of TTPHunter, as incomplete TTP intelligence can lead to a reactive rather than proactive security posture. Addressing this gap by advancing TTPHunter is crucial to ensure that threat intelligence is exhaustive and reflects the diverse adversarial tactics.

Therefore, this study presents TTPXHunter, an extended version of TTPHunter [22], which meticulously refines and expands to recognize an impressive array of \(193\) TTPs. We address the limited data problem of TTPHunter by introducing an advanced data augmentation method. This method meticulously preserves contextual integrity and enriches our training dataset with additional examples for emerging and less commonly used TTPs. Moreover, we employ a domain-specific language model that is finely tuned to grasp the nuanced, context-driven meanings within this domain and enhance both the augmentation process and the TTP classification. Further, we also enable TTPXHunter to convert the extracted TTPs into the machine-readable format, i.e., Structured Threat Information eXpression (STIX), which facilitates the automated exchange, easier analysis, and integration across different security tools and platforms [5, 8]. This advancement significantly broadens the scope of threat intelligence gleaned from threat reports and offers deeper insights into the TTPs employed by cyber adversaries in APT campaigns. By extending TTPHunter’s extraction capabilities, this research contributes to the critical task of threat intelligence gathering, providing security analysts and practitioners with a more comprehensive toolset for identifying, understanding, and countering APTs. The notations and their descriptions used to explain the methodology are listed in Table 1.

Table 1.

Notation	Description
\(S_{i}\)	\(i^{\textrm{th}}\) sentence present in the report
\(w_{i}\)	\(i^{\textrm{th}}\) word in the sentence
\(n\)	No. of sentences present in the report
\(Aug\_S\)	List of augmented sentences for the sentence \(S\)
\(\theta\)	Threshold to choose relevant sentences
\(S_{embed}\)	Embedding vector for sentence \(S\)
\(Top\_5\)	Top-5 selected words to create augmented sentence
\(f(.)\)	Embedding function
\(\mathcal{M}\)	Fine-tuned classifier
\(v_{i}\)	Feature vector for \(i\mathrm{th}\) sentence
\(t_{i}\)	Predicted TTP class for \(i\mathrm{th}\) sentence in the report
\(\Theta\)	Threshold to filter irrelevant sentences from threat report
\(SCORE\)	Similarity score between augmented and original sentence
\(\hat{y}\)	True label multi-hot vector for given threat report
\(y\)	Predicted label multi-hot vector for given threat report
\(HL\)	Hamming Loss
\(\mathcal{R}\)	A threat report which consists of \(n\) sentences \((S)\)
\(\mathcal{T}\)	Set of unique TTPs present in the database
\(m\)	No. of unique TTPs present in the database
\(\hat{\mathcal{T}}\)	Predicted set of TTPs
\(d\)	No. of words present in the sentence

Table 1. Notations and Their Description

TTPXHunter applies to a threat report denoted as \(\mathcal{R}\) and extracts set of TTPs denoted as \(\hat{\mathcal{T}}\) explained within the report, where \(\hat{\mathcal{T}}\subseteq\mathcal{T}\), and \(\mathcal{T}=\{t_{1},t_{2},t_{3},\ldots,t_{m}\}\) which denotes a set of target labels having \(m\) number of TTPs. The methodology initiates with the transformation of sentences segmented from threat report, denoted as \(\mathcal{R}=\{S_{1},S_{2},\ldots,S_{n}\}\); \(n\) represents the number of sentences in the report, into a high-dimensional feature space, i.e., \(768\)-dimension.² This transformation, achieved through a cyber domain-specific embedding function \(f(\cdot)\), maps each segmented sentence \(S_{i}\) to a \(768\)-dimension feature vector \(v_{i}\) as

\begin{align*}v_{i}=f(S_{i});\quad i\in\{1,....,n\}.\end{align*}

The embedding function encapsulates cybersecurity language’s rich semantic and syntactic properties and the stage for nuanced inference. TTPXHunter exhibits an inferential capability to predict a specific TTP for each segmented sentence based on its embedded feature vector

\begin{align*}t_{i}=\mathcal{M}(v_{i});\quad i\in\{1,....,m\},\end{align*}

where \(t_{i}\) represents the predicted TTP class for the feature vector \(v_{i}\), and \(\mathcal{M}\) embodies the fine-tuned classification model that aids in understanding the complex relationship between the features of the segmented sentence and the TTP class. Further, the TTP extraction from the threat report \(\mathcal{R}\) is performed as

\begin{align*}\hat{\mathcal{T}}=\bigcup_{i=1}^{m}t_{i}.\end{align*}

Our key contributions to this research are the following:

—

We introduce TTPXHunter,³ an extended version of TTPHunter [22], which extracts TTPs from threat intelligence reports using SecureBERT [1] language model. We fine-tune the model on our prepared augmented sentence-TTP dataset and convert the extracted TTPs into structured and machine-readable STIX format.

—

We introduce a data augmentation method that utilizes a cyber-domain-specific language model. This approach creates new and varied sentences related to TTPs by preserving their contextual meaning.

—

By leveraging the presented data augmentation method, we build an augmented sentence-TTP dataset using the dataset prepared from the MITRE ATT&CK knowledgebase. We build the augmented dataset using the Masked Language Model (MLM) feature of the contextual natural language model, and it extends samples from \(10,906\) sentences to \(39,296\) sentences covering 193 ATT&CK TTPs.

—

We perform an extensive evaluation to measure the efficiency of TTPXHunter over two different types of datasets: The augmented sentence dataset and the Threat report dataset. We manually label \(149\) real-word threat reports for the report dataset.

The remainder of the article is structured as follows: Section 2 discusses the current literature, Section 3 presents the required background, Section 4 discusses presented TTPXHunter method, Section 5 demonstrate the experiment done and evaluate the obtained results, Section 6 presents the limitation of the proposed method along with it is possible future direction and Section 7 concludes this research contribution.

2 Related Work

Research on TTP extraction from threat intelligence reports is widely based on ontology, graphs, and keyword-phrase matching methods [21, 22].

Initially, Husari et al. [13] present TTPDrill, which extracts threat actions based on ontology and matches TTP’s knowledge base with extracted threat action using the BM25 matching technique. Legoy et al. [14] present rcATT, a classification tool based on Machine Learning algorithms. They use Term Frequency–Inverse Document Frequency (TF-IDF) to prepare the dataset and perform multi-class classification for target labels as TTP. The rcATT is based only on sentence keywords, as TF-IDF ignores the sentence’s word sequence and context. The model may fail when synonym words are present and polymorphic nature words (the same words have a different meaning in a different context). Li. et al. [15] introduce AttacKG, a graph template matching technique. The method obtains Indicators of Compromise (IOCs) and constructs entity-based dependency graphs for every TTP present in ATT&CK by leveraging descriptions on the MITRE website and matching them with the TTP’s templates prepared from MITRE website data. To match the templates, they use the graph alignment algorithm. AttacKG struggles to capture techniques identified by adjectives (properties of IOCs) present in the sentences rather than verbs (threat actions), such as masqueraded identity and obfuscated malware. Alam et al. [4] introduce LADDER, a framework designed to enhance cyber threat intelligence (CTI) by extracting and analyzing attack patterns from CTI reports, addressing the limitations of traditional CTI that focuses on static indicators like IP addresses. LADDER contains a subpart named TTPClassifier, which is structured into three key steps: identifying sentences with attack pattern descriptions using a binary classification model, pinpointing and extracting these attack phrases with a sequence tagging model, and finally, mapping these patterns to TTP IDs via cosine similarity method. To deal with the polymorphic nature of words and leverage contextual information in the report sentences, Rani et al. [22] present TTPHunter. This tool leverages the language models BERT and RoBERTa to understand the context of the sentences in the dataset and maps it to the correct TTP ID. Due to the dearth of sentence datasets for many TTPs, TTPHunter is trained for only \(50\) sets of TTPs and has yet to map the full spectrum of TTPs in the MITRE knowledge base. After that, MITRE also introduced a TTP extraction tool named Threat Report ATT&CK Mapping (TRAM) [17]. It is based on the scientific BERT model, named SciBERT [6], a fine-tuned BERT model on a collection of scientific reports.

In this article, we extend the state-of-the-art TTPHunter [22] to extract TTPs present in the collected threat report. The TTPHunter is fine-tuned on traditional BERT and RoBERTa models, whereas we fine-tune cyber-domain-specific language models to identify TTPs. Aghaei et al. [1] show that the cyber-domain-specific language model can perform better than the traditional model trained on general English sentences. Domain-specific words such as Windows and registry have different meanings regarding general and cybersecurity usage. In addition, we solve the limited dataset problem pointed out by [22] using the data augmentation method. Our proposed method can extract the full spectrum of TTPs in the MITRE ATT&CK knowledgebase. A comparison of our proposed model TTPXHunter with the literature is shown in Table 2.

Table 2.

Research Work	Year	Extraction Technique	Sentence Context-aware	Identify Relevant Text	STIX Support	Domain-specific Capability	Range of MITRE ATT&CK TTPs
Husari et al. [13]	\(2017\)	TF-IDF, Ontology-based, and improved BM25 similarity rank	\(\times\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	All TTPs
Legoy et al. [14]—rcATT	\(2020\)	TF-IDF and ML models (k-nearest neighbors, Decision Tree, Random Forest, and Extra Tree)	\(\times\)	\(\times\)	\(\checkmark\)	\(\times\)	All TTPs
Li et al. [15]—AttacKG	\(2022\)	Entity-based dependency graph and Graph-alignment algorithm	\(\times\)	\(\checkmark\)	\(\times\)	\(\checkmark\)	All TTPs
Rani et al. [22]—TTPHunter.	\(2023\)	BERT/RoBERTa followed by Linear Classifier	\(\checkmark\)	\(\checkmark\)	\(\times\)	\(\times\)	50 most frequently used TTPs only
Alam et al. [4]—LADDER	\(2023\)	Extract attack phases using a sequence tagging model and map these patterns to TTP using cosine similarity	\(\checkmark\)	\(\checkmark\)	\(\times\)	\(\times\)	All TTPs
TRAM [17]	\(2023\)	SciBERT followed by Linear Classifier	\(\checkmark\)	\(\times\)	\(\times\)	\(\times\)	50 most frequently used TTPs only
TTPXHunter (Proposed)	\(2024\)	Domain-specific SecureBERT with Linear Classifier	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	All TTPs

Table 2. TTP Extraction Methods Comparison

TF-IDF, Term Frequency–Inverse Document Frequency; ML, Machine Learning; TRAM, Threat Report ATT&CK Mapping.

3 Background

This section presents the details of every framework, tool, and method used to get insight into their background, which is required to understand the methodology.

3.1 MITRE ATT&CK Framework

The MITRE ATT&CK framework [16] is a globally acknowledged comprehensive knowledge base designed to understand cyber threats and adversary behaviors across different stages of a cyber attack. It introduces a standardized lexicon and framework for the classification and detailed description of attacker’s TTPs. The ATT&CK framework comprises three primary components: TTPs. Tactics represent the objectives or goals adversaries aim to achieve throughout an attack, encompassing \(14\) distinct categories such as Reconnaissance, Resource Development, Initial Access, Execution, and others [16]. Techniques detail the specific methods adversaries employ to fulfill these tactics, each with a unique identifier and a comprehensive description. Examples include phishing, password spraying, remote code execution, and PowerShell exploitation. Procedures delve into the detailed execution of techniques by attackers to achieve their tactical objectives, illustrating how specific methods are applied in practice, such as deploying a spearphishing e-mail with malicious attachments to achieve initial access. Beyond cataloging techniques, the ATT&CK framework maps real-world attack scenarios to known techniques that enable organizations to understand better how adversaries navigate through the stages of an attack. This knowledge fosters the development of proactive defenses and offensive security practices, enhances incident response efforts, and enriches threat intelligence [2, 9].

3.2 BERT Language Model

The natural language model plays a vital role in CTI, particularly in extracting threat intelligence from natural language cybersecurity texts. Among the popular language models, BERT [32] holds significant importance in tasks such as extracting threat intelligence, classifying threat data, Name Entity Recognition, detecting spam and phishing attacks [1, 24, 31]. BERT comprises a 12-layer stacked encoder part of the transformer, which generates contextualized embeddings for natural language inputs. The training of the BERT model involves two key components: MLM and Next Sentence Prediction (NSP). In the MLM, random words within a sentence are masked, and the model learns to predict the masked word based on the context of surrounding words present in the sentence. The NSP focuses on understanding the context and meaning of a sentence by predicting the likelihood of the next sentence in a given pair of sentences. Both tasks demonstrate the BERT model’s ability to grasp the contextual understanding of sentences and capture the relationships between words within a sentence. The BERT model’s proficiency in understanding sentence context and word relationships makes it valuable for various downstream tasks in CTI.

3.3 TTPHunter

A recent tool, TTPHunter [22], leverages the power of the language model BERT to extract threat intelligence in the form of TTPs from natural language threat report texts. The authors of TTPHunter fine-tune a pre-trained BERT model using a sentence-TTP dataset collected from the MITRE knowledge base to let the model understand the context of TTPs present in the MITRE ATT&CK matrix and map a given sentence to relevant TTPs. One noteworthy feature of TTPHunter is its ability to discern the significance of sentences. Rather than indiscriminately mapping all sentences from a threat report, TTPHunter selectively identifies and considers only those sentences that truly explain TTPs, disregarding irrelevant information. To achieve this, authors of TTPHunter implement a filtering mechanism on the model’s predicted probabilities. Their experiments discovered that sentences with a probability higher than \(0.64\) are relevant, while those below the threshold were considered irrelevant. We adopt the same threshold for identifying relevant sentences by TTPXHunter.⁴

4 TTPXHunter

In this study, we introduce TTPXHunter, which is designed to overcome the challenges of limited data for less commonly encountered TTPs and improve the performance of TTPHunter. It incorporates a novel methodology encompassing data augmentation and domain-specific language models to enhance the performance of TTP classification. The notations used to explain the methodology are listed in Table 1.

TTPXHunter follows the idea used in TTPHunter of leveraging the natural language model to extract TTPs from threat reports. TTPHunter is based on the model, which is fine-tuned on general English sentences. The model trained on general sentences can mislead the contextual embedding for domain-specific words having different contexts comparatively. Cyber domain-specific words such as Windows and registry have entirely different meanings in the cyber domain. Hence, we observe the need for a TTP extraction tool based on domain-specific language models to provide more accurate contextual embeddings. To fill this gap of domain-specific knowledge, we leverage the SecureBERT [1] to fine-tune the proposed TTPXHunter and map sentences to the relevant TTPs they represent. TTPXHunter incorporates a filtration mechanism to identify and exclude irrelevant sentences from the TTP extraction results obtained from the threat reports. The filtering mechanism ensures the model extracts the most pertinent and meaningful TTPs for further analysis and discards unrelated sentences. The overview and architecture of TTPXHunter are shown in Figure 1.

Fig. 1.

The TTPHunter [22] is limited to \(50\) prominent TTPs only out of \(193\) TTPs due to the limited sentence in the database for the remainder of the TTP class, which results in incomplete or limited threat intelligence. Enhancing its extraction capabilities is crucial for developing comprehensive and proactive security strategies by leveraging the full spectrum of threat intelligence. Therefore, we address this problem by presenting a data augmentation method in TTPXHunter. This method creates more samples for the minority TTP class.

4.1 Contextual Data Augmentation

We leverage the MLM capability of the natural language model to expand the dataset and address the limited samples problem of TTPHunter for the remainder of the TTP class. For MLM, we employ the domain-specific language model called SecureBERT [1]. In this process, we mask words in each sentence with a special token \({\lt}mask{\gt}\) and employ SecureBERT to predict the masked word using its MLM capability. SecureBERT provides a list of candidate words and their corresponding probabilities, which maintain the contextual meaning when replacing the masked word.

We can see at step ① in Figure 2, an example of input sentence as “Adversary obtained credentials using compromised systems.” In this case, if we mask the word “Adversary,” then the model predicts possible words that preserve the contextual meaning (step ② in Figure 2). The probabilities associated with each word indicate the confidence level of the predicted word. We select the top \(5\) words from this list and generate five new sentences for each input sentence (step ③ in Figure 2). The sentence’s meaning may deviate after replacing the masked word with the predicted word. Therefore, we employ cosine similarity to compare each newly generated sentence with the original sentence (step ④ in Figure 2) and select the sentence with the highest similarity index. To compute the cosine similarity, we generate contextual embedding for both the original and generated sentences using Sentence Transformer [23]. The similarity score ranges between \(0\) (not similar) to \(1\) (exactly similar), and we set a threshold \((\theta)\) of \(0.975\). We retain only those sentences having similarity scores greater than \(\theta\). The sentence augmentation algorithm is outlined in Algorithm 1, which we employ to construct the augmented sentence-TTP dataset. More detailed information regarding the distribution of the augmented sentence-TTP dataset is present in the Section 5.1.

Fig. 2.

Algorithm 1 Data Augmentation Algorithm

Input: \(S\leftarrow[w_{1},w_{2},\dots,w_{d}]\) ⊳Input sentence consisting \(d-\)words

Output: \(Aug\_S\) ⊳List of augmented sentences for the input sentence

1: \(Aug\_S\leftarrow[\;]\)

2: for \(i=1\) to \(d\) do

3: \(S^{\prime}\leftarrow S\)

4: \(S^{\prime}[i]\leftarrow{\lt}mask{\gt}\) ⊳Mask the \(i\)th word in sentence S

5: \(Predicted\_words\leftarrow \text{S}{\small \text{ECURE}}\text{BERTMLM}{(S^{\prime})}\) ⊳Predict probable word using MLM

6: \(Top\_5\leftarrow \text{S}{\small \text{ELECT_TOP_5}}{(Predicted\_words)}\) ⊳Select top-5 probable words based on their probabilities

7: for \(word\) in Top_5 do

8: \(S^{\prime}[i]\leftarrow word\) ⊳Augmented sentence obtained by replacing masked token

9: \(SCORE\leftarrow \text{C}{\small \text{OSINE_}}\text{S}{\small \text{IMILARITY}}{(S,S^{\prime})}\)

10: if \(SCORE\geq\theta\) then

11: \(Aug\_S\leftarrow Aug\_S\cup S^{\prime}\) ⊳Append augmented sentence to output list

12: end if

13: end for

14: end for

15: Return \(Aug\_S\)

16: function \(\text{C}{\small \text{OSINE_}}\text{S}{\small \text{IMILARITY}}{(S,S^{\prime})}\)

17: \(S_{embed}\leftarrow \text{S}{\small \text{ENTENCE_}}\text{T}{\small \text{RANSFORMER}}{(S)}\)

18: \(S^{\prime}_{embed}\leftarrow \text{S}{\small \text{ENTENCE_}}\text{T}{\small \text{RANSFORMER}}{(S^{\prime})}\)

19: Return \(cosine(S_{embed},S^{\prime}_{embed})\)

20: end function

4.2 Prepossessing and Fine-Tuning

The sentences present in the dataset consist of irrelevant structures, which we fix during the pre-processing method and let the data add more value to the dataset. First, we remove the citation references from these sentences, which refer to past attack campaigns or threat reports. Further, we find various IOCs patterns present in the sentences, which potentially obscure the contextual understanding of the sentences. For example, IP and domain addresses, file paths, CVE IDs, e-mails, and registry paths. We implement an IOC replacement method to overcome the obstruction these patterns impose in sentences. This method uses regular expressions (regex) to substitute IOC patterns with their respective base names. For example, a sentence like ”Upon execution, the malware contacts the C2 server at attacker-example.com, drops an executable payload.exe at C:\(\backslash\)Users\(\backslash\)Default\(\backslash\)AppData\(\backslash\)Roaming, and creates an autorun entry in HKEY_LOCAL_MACHINE\(\backslash\)Software\(\backslash\)Microsoft\(\backslash\)Windows\(\backslash\)CurrentVersion\(\backslash\)Run” get transformed to ”Upon execution, the malware contacts the C2 server at domain, drops an executable file at file path, and creates an autorun entry in registry.” This method lets the model add more contextual information rather than hindering the meaning, which helps better TTP identification. Table 3 details the considered IOCs example, their corresponding base name, and the use of regex pattern. The processed sentences get passed further for fine-tuning.

Table 3.

Base Name	Example Pattern	Regex
Registry	HKEY_LOCAL_MACHINE\XXX\XXX\XX\Run	(HKEY_LOCAL_MACHINE\|HKEY_CURRENT_USER\|HKEY_CLASSES_ROOT\|HKEY_USERS\|HKEY_CURRENT_CONFIG)\\(?:[^\\]+\\)*[^\\]+
Email	[email protected]	[a-zA-Z0-9._%\+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
IP Address	00.00.00.00	(\d{1,3}\.){3}\d{1,3}
Domain	www.domain.com	[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
File Path	C:\XXX\yy	([a-zA-Z]:\\)?(?:[a-zA-Z0-9_-]+\\)*[a-zA-Z0-9_-]+\.[a-zA-Z0-9_]+
File Name	file.exe	[a-zA-Z0-9_-]+\.[a-zA-Z0-9_]+
CVE	CVE-YYYY-XXXX	CVE-\d{4}-\d{4,}

Table 3. IOC Replacement

The finetuning process is similar to the TTPHunter’s finetuning. As TTPHunter extracts sentence embedding from the traditional BERT model and passes it to a linear classifier for classification, we employ a similar method in TTPXHunter. It first extracts domain-specific embedding using the SecureBERT language model and passes embedding to a linear classifier for TTP classification. The finetuning process is meticulously orchestrated to optimize performance on the processed sentence dataset. We initiate the process by configuring essential parameters, setting the learning rate to \(1e-5\) to ensure subtle adjustments to the model, choosing a batch size of \(64\) to balance computational efficiency with training effectiveness, and training it for \(10\) epochs. We leverage the Hugging Face’s Transformers library [11] for its robust support of transformer models. Sentences are tokenized using SecureBERT’s tokenizer, aligning with its pre-trained understanding of language structure, and inputs are standardized to a maximum length of \(256\) tokens. We finetune our model to let it understand the context for ATT&CK TTP, which further enhances the classification capability. The model’s finetuning process is shown in Figure 3.

Fig. 3.

4.3 TTP Extraction

TTPXHunter takes finished threat reports as input and breaks the report contents into a list of individual sentences. Then, each sentence is processed and passed to the TTPXHunter’s finetuning architecture for classification. The system classifies each sentence to identify TTPs. Finally, it aggregates these TTPs and creates a list of TTPs extracted from the analyzed report. A threat report usually contains many irrelevant sentences that do not reflect TTPs. However, our classification module is a closed-world solution (the input will surely be classified into at least one target class). As a result, irrelevant sentences can also be classified as one of the TTPs in the target class, which can lead to huge false positives. To reduce this effect, we filter irrelevant sentences, similar to TTPHunter. We employ a threshold mechanism in the classification module to filter irrelevant sentences and only map sentences that explain TTPs, i.e., a relevant sentence. We filter irrelevant sentences based on classification confidence score. We fix a threshold \((\Theta)\) and filter the sentences if the classifier’s confidence score is below the threshold. We follow the same threshold value experimentally given by TTPHunter, i.e., \(0.644\).

Once the linear classifier extracts the list of TTPs present in the threat report, TTPXHunter converts the list to the STIX [5] format, significantly enhances CTI operations. It supports detailed analysis and correlation of threats by providing a rich, structured representation of cyber threat information. This structured format facilitates automated processing and threat response, which increases operational efficiency. Moreover, the consistency and standardization offered by STIX improve communication within and between organizations to ensure a common understanding of cyber threats.

5 Experiments and Results

This section presents details about the prepared dataset and the experiments. We also discuss the results obtained and chosen performance measures.

5.1 Dataset

We compute model performance on our test dataset through sentence-to-TTP mapping. However, in the real-world, threat data is in the form of reports containing a set of sentences mixed with relevant and irrelevant sentences. Therefore, we also evaluate the model’s performance report-wise. As a result, we prepared two datasets to measure the efficiency of TTPXHunter and the current literature: (\(1)\) The augmented Sentence-TTP dataset and (\(2)\) The Report-TTP dataset.

5.1.1 Augmented Sentence-TTP Dataset.

To prepare the sentence-TTP dataset, we prepare the base dataset using MITRE ATT&CK knowledgebase [22]. The base dataset consists of two columns: sentences and their corresponding TTP ID. The dataset consists of \(10,906\) sentences over \(193\) TTPs. We use our proposed data augmentation algorithm, explained in Algorithm 1, and extend the dataset to \(39,296\) sentences distributed over \(193\) TTP classes. The TTP ID \(T1059\) (Command and Scripting Interpreter) consists of the highest number of sentences as \(800\), and TTP ID \(T1127\) (Trusted Developer Utilities Proxy Execution) consists of the lowest number of samples as \(3\). On average, the number of samples in our dataset is \(203\). The distributions of the base and augmented dataset samples are shown in Figures A1 and A2, respectively, present in Appendix A. Additionally, the glimpse of augmented dataset is shown at step ⑤ in Figure 2.

5.1.2 Report-TTP Dataset.

Generally, threat data is present in the form of threat analysis reports. Therefore, we evaluate TTPXHunter on a document dataset, which demonstrates the performance of TTPXHunter in filtering irrelevant sentences from threat reports. It also tells us how efficiently TTPXHunter can extract threat intelligence from threat reports. We manually collect \(149\) threat reports published by various prominent security firms, and we manually label the set of TTPs present in each report to prepare the ground truth. The document-TTP dataset contains two columns: threat report and list of TTP present in the corresponding threat report.

5.2 Evaluation

We evaluate TTPXHunter using various performance metrics and compare its performance with several current studies. As our dataset is imbalanced, shown in Appendix Figure A2, we consider macro-averaged precision, recall, and f1-score [19]. This approach ensures balanced evaluation across all classes [12]. By giving equal importance to each TTP class, these metrics prevent the majority class’s dominance from overshadowing the minority class’s performance. It promotes the development of effective and fair models across different TTP classes and is essential for nuanced TTP classification tasks.

5.2.1 Augmented Sentence-based Evaluation.

We finetune TTPXHunter on the prepared augmented dataset and compare the other two BERT-based models, i.e., TRAM [17] and TTPHunter [22] present in the literature. We divide the prepared augmented dataset into train and test sets with an \(80:20\) ratio.

TTPXHunter vs TRAM. We fine-tune the TTPXHunter on the train set and evaluate its performance using chosen performance metrics. Further, we finetune a literature TRAM [17] on the same dataset and evaluate its performance. By employing TRAM on the augmented dataset, we extend the capability of TRAM from the 50 most frequently used TTPs to the full spectrum of TTPs. As a result, it gives common ground for comparing TTPXHunter and TRAM. The result obtained by both methods and their comparison is shown in Figure 4. As we can see, the TTPXHunter outperforms the TRAM, which reflects the difference between contextual embedding of general scientific BERT and domain-specific BERT embeddings. This result reflects that the domain-specific language model provides a better contextual understanding of embedding than the language model trained on general scientific terms.

Fig. 4.

TTPXHunter vs TTPHunter. We assess the performance of our proposed TTPXHunter alongside state-of-the-art TTPHunter. Our proposed extraction method, TTPXHunter, performs better than state-of-the-art TTPHunter. TTPXHunter’s superiority is due to using a cyber-domain-specific finetuned language model. Sentences containing domain-specific terms, such as “Window” and “registry,” introduce a distinct context that differs from general English. This distinction allows our method, based on the domain-specific language model, to capture and interpret the contextual meaning more accurately than traditional models. In addition, TTPXHunter can identify the range of \(193\) TTPs with \(0.92\) f1-score, whereas TTPHunter is limited to only \(50\) TTPs. The improvement in the result and the capability to identify all ranges of TTPs make TTPXHunter superior to TTPHunter.

Further, to understand the efficiency of the data augmentation method and compare the performance of TTPXHunter and TTPHunter on the same base, we evaluate TTPXHunter on the ground of TTPHunter. It’s important to note that we did not apply TTPXHunter directly to the base dataset due to the presence of a few TTPs with only one sample, which makes it challenging to split the data evenly into training and testing sets. Therefore, we decided to compare TTPXHunter using the base dataset of TTPHunter, which consists of 50 TTP classes. We only selected 50 TTP sets for which we developed TTPHunter and evaluated both. The obtained result is shown in Figure 5. As we can see, TTPXHunter outperforms TTPHunter on the 50-TTP set ground of TTPHunter. It reflects that augmenting more samples for a 50-TTP set and employing a domain-specific language model enables the classifier to better understand the context of TTP.

Fig. 5.

5.2.2 Report-based Evaluation.

In the real-world, we have threat reports in the form of natural language rather than sentence-wise datasets, and these reports contain information along with TTP-related sentences. So, we evaluate the TTPXHunter on the report dataset, which contains each sample as threat report sentences and a list of TTP explained in the report. Extracting TTPs from threat reports is a multi-label problem because a list of TTP classes is expected as output for any given sample, i.e., threat report in this case. The evaluation of such classification also requires careful consideration because of multi-label classification.

Evaluation Metrics. In the multi-label problem, the prediction vector appears as a multi-hot vector rather than a one-hot vector of a multi-class problem. In the multi-label case, there may be a situation where not all expected TTP classes were predicted; instead, a subset of them is correctly predicted. However, the prediction may be wrong because the whole multi-hot vector does not match. For example, if the true label set contains \(\{T1,T2,T3\}\) and the predicted label is \(\{T2,T3\}\), then it may be considered to be a mismatch even though \(T2\) and \(T3\) are correctly classified. So, relying on accuracy may not be a good choice for multi-label problems [25]; instead, we consider hamming loss as a performance metric to deal with such a scenario. The hamming loss measures the error rate label-wise [10, 25, 33]. It calculates the ratio of incorrect labels to all labels. For given \(k\) threat reports, the hamming loss is defined as

\begin{align}{\rm HL}=\frac{\sum_{i=1}^{k}[y_{i}\oplus\hat{y}_{i}]}{k},\end{align}

(1)

where, \(y_{i}\) and \(\hat{y}_{i}\) are multi-hot predicted labels and true label for \(i\mathrm{th}\) instance, respectively. The \(\oplus\) represents element-wise exclusive OR operation. The low hamming loss represents that models make minimal wrong predictions.

Further, we also evaluate macro precision, recall, and f1-score by leveraging a multi-label confusion matrix package from sklearn [19]. Then, we calculate true positive, false positive, and false negative for each class and calculate these performance metrics. Further, we calculate the macro average between all classes to get macro-averaged performance metrics for all chosen measures, i.e., precision, recall, and f1-score. We prefer the macro-average method to ignore biases toward the majority class and provide equal weight to all classes.

We consider four state-of-the-art methods, i.e., [4, 14, 15, 17] for comparison against TTPXHunter based on these metrics over the report dataset. This comparison aims to understand the effectiveness of TTPXHunter over state-of-the-art for TTP extraction from finished threat reports. These methods provide the list of TTPs extracted from the given threat report and the model confidence score for each. Out of all extracted TTPs, only relevant TTPs are selected based on the threshold mechanism decided by each method. We evaluate the state-of-the-art method’s performance based on the threshold value given in their respective articles. For TTPXHunter, we obtain the same threshold experimentally chosen by TTPHunter, i.e., \(0.644\). The results obtained from all implemented methods on our report dataset are shown in Table 4. It demonstrates that TTPXHunter outperforms all implemented methods across all chosen metrics, i.e., the lowest hamming loss and the highest other performance metrics. It achieves the highest f1-score of \(97.09\%\), whereas out of all state-of-the-art methods, LADDER [4] performs better than other state-of-the-art methods and achieves \(2\mathrm{nd}\) highest performance of \(93.90\%\) f1-score. This performance gain over state-of-the-art methods demonstrates the efficiency of the TTPXHunter, and we plan to make it open for the benefit of the community.

Table 4.

Model	Precision (%)	Recall (%)	F1-score (%)	Hamming Loss
ATTACKG [15]	88.58	95.22	88.52	0.14
rcATT [14]	30.47	44.03	30.56	0.64
Ladder [4]	92.97	95.73	93.90	0.10
TRAM [17]	94.54	94.33	93.49	0.10
TTPXHunter (Proposed)	97.38	96.15	97.09	0.05

Table 4. Comparison with State-of-the-Art Methods over Report Dataset

As this experiment involves \(193\) target TTP classes, it is challenging to visualize the class-wise performance of the employed models. Therefore, we follow a different way to assess the class-wise efficiency of the employed models. We count the number of TTP classes whose chosen performance metrics lie within a range interval. We employ a range interval of \(0.1\), i.e., \(10\%\), to calculate the number of TTP classes whose score falls into that range.

We calculate this across all five methods and three chosen performance metrics, i.e., precision, recall, and f1-score. The obtained results are present in Figures 6–8. The observation reveals that most TTP classes analyzed by rcATT fall within the \(0-0.10\) range, contributing to its overall lower performance. This performance is due to the reliance on the TF-IDF method to transform sentences into vectors. TF-IDF is a numerical statistic that reflects how important a word is to a document in a collection or corpus, balancing its frequent appearance within a document against its commonness across all documents [26, 27, 29]. This method lacks the ability to understand the context and semantic relationships between words, making it unable to grasp the overall meaning of sentences [18, 28]. ATTACKG, conversely, exhibits a lower score within the \(0-0.10\) range for certain TTP classes, which adversely affects its overall performance. However, TRAM shows a minimum performance range of \(0.3-0.4\) for TTP classes, which indicates better performance than rcATT and ATTACKG. LADDER ensures a performance score of at least \(0.4-0.5\) for a TTP class, which positions it ahead of the aforementioned methods, including rcATT, ATTACKG, and TRAM. Our proposed model, TTPXHunter, assures a minimum score of \(0.6-0.7\) for TTP classes, with the majority exhibiting scores between \(0.9-1.0\), which underscores TTPXHunter’s significant advantage over the other methods. It demonstrates the effectiveness of domain-specific models for domain-specific downstream tasks.

Fig. 6.

Fig. 7.

Fig. 8.

6 Limitations and Future Directions

In addition to the advantages of TTPXHunter, like threat intelligence extraction, threat profiling, and sharing, it has some limitations that one should be careful about while using for actionable insights. The MITRE ATT&CK framework is not a kind of fixed knowledge base. Instead, MITRE threat researchers are continuously updating it. One may need to finetune the model again if new TTPs are added to get the newer TTPs predicted. While finetuning on newer TTPs may require using our proposed data augmentation method to create new augmented sentences specific to newer TTPs. Therefore, we will make it public so one can adapt our method for any new upcoming versions of the ATT&CK framework.

Further, the TTPXHunter contains a one-to-one classifier model, which maps a given sentence to a single TTP. Like a single sentence, a sentence containing one-to-many mapping may explain more than one TTP. For example, the sentence is “The attacker gained initial access through a phishing e-mail and obtained persistence via run registry modification.” TTPXHunter maps this sentence to T1566 (Phishing) or T1037 (Boot or Logon Initialization Scripts) in such a scenario. Extending the capability of TTPXHunter to identify such one-to-many mapping can help us improve the model’s performance. We plan to take up this challenge to improve the efficiency of TTPXHunter in the future.

7 Conclusion

This research demonstrates the efficiency of domain-specific language models in extracting threat intelligence in terms of TTPs. It facilitates the sharing of the attack patterns and accelerates the threat response and detection mechanisms. The tool TTPXHunter extends the TTPHunter’s capability by expanding to the full spectrum of TTP extraction and improving efficiency by leveraging domain-specific language models. We evaluate the efficiency of TTPXHunter over the prepared augmented sentence-TTP dataset and report-TTP dataset. TTPXHunter outperforms both baselines on the augmented dataset, i.e., TRAM and TTPHunter. TTPXHunter also outperforms the state-of-the-art TTP extraction methods by achieving the highest f1-score and lowest hamming loss. TTPXHunter’s performance over the report dataset demonstrates the model’s efficiency in capturing relevant sentences from threat reports and correctly classifying them to the TTP class. The conversion of extracted TTPs to STIX makes integrating threat intelligence into security operations easier. TTPXHunter aids in improving the threat analyst’s capability to share intelligence, analyze threats, understand the modus operandi of sophisticated threat actors, and emulate their behavior for red teaming. Therefore, TTPXHunter can support various cybersecurity teams, including red, blue, and purple teams in an organization.

Footnotes

Categorized in the MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) framework.

As this model leverages the Bidirectional Encoder Representations from Transformers (BERT) architecture, it inherently produces embeddings of this dimension size.

Available: https://github.com/nanda-rani/TTPXHunter-Actionable-Threat-Intelligence-Extraction-as-TTPs-from-Finished-Cyber-Threat-Reports

⁴

https://github.com/nanda-rani/TTPHunter-Automated-Extraction-of-Actionable-Intelligence-as-TTPs-from-Narrative-Threat-Reports

References

[1]

Ehsan Aghaei, Xi Niu, Waseem Shadid, and Ehab Al-Shaer. 2022. Securebert: A domain-specific language model for cybersecurity. In Proceedings of the International Conference on Security and Privacy in Communication Systems. Springer, 39–56.

Abstract

1 Introduction

2 Related Work

3 Background

3.1 MITRE ATT&CK Framework

3.2 BERT Language Model

3.3 TTPHunter

4 TTPXHunter

4.1 Contextual Data Augmentation

4.2 Prepossessing and Fine-Tuning

4.3 TTP Extraction

5 Experiments and Results

5.1 Dataset

5.1.1 Augmented Sentence-TTP Dataset.

5.1.2 Report-TTP Dataset.

5.2 Evaluation

5.2.1 Augmented Sentence-based Evaluation.

5.2.2 Report-based Evaluation.

6 Limitations and Future Directions

7 Conclusion

Footnotes

References

A Augmented and Base Sentence-TTP Dataset Distribution

Cited By

Index Terms

Recommendations

TTPHunter: Automated Extraction of Actionable Intelligence as TTPs from Narrative Threat Reports

The AI-Based Cyber Threat Landscape: A Survey

AI vs. AI: Exploring the Intersections of AI and Cybersecurity

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations