We evaluate TTPXHunter using various performance metrics and compare its performance with several current studies. As our dataset is imbalanced, shown in Appendix
Figure A2, we consider macro-averaged precision, recall, and f1-score [
19]. This approach ensures balanced evaluation across all classes [
12]. By giving equal importance to each TTP class, these metrics prevent the majority class’s dominance from overshadowing the minority class’s performance. It promotes the development of effective and fair models across different TTP classes and is essential for nuanced TTP classification tasks.
5.2.1 Augmented Sentence-based Evaluation.
We finetune TTPXHunter on the prepared augmented dataset and compare the other two BERT-based models, i.e., TRAM [
17] and TTPHunter [
22] present in the literature. We divide the prepared augmented dataset into train and test sets with an
\(80:20\) ratio.
TTPXHunter vs TRAM. We fine-tune the TTPXHunter on the train set and evaluate its performance using chosen performance metrics. Further, we finetune a literature TRAM [
17] on the same dataset and evaluate its performance. By employing TRAM on the augmented dataset, we extend the capability of TRAM from the 50 most frequently used TTPs to the full spectrum of TTPs. As a result, it gives common ground for comparing TTPXHunter and TRAM. The result obtained by both methods and their comparison is shown in
Figure 4. As we can see, the TTPXHunter outperforms the TRAM, which reflects the difference between contextual embedding of general scientific BERT and domain-specific BERT embeddings. This result reflects that the domain-specific language model provides a better contextual understanding of embedding than the language model trained on general scientific terms.
TTPXHunter vs TTPHunter. We assess the performance of our proposed TTPXHunter alongside state-of-the-art TTPHunter. Our proposed extraction method, TTPXHunter, performs better than state-of-the-art TTPHunter. TTPXHunter’s superiority is due to using a cyber-domain-specific finetuned language model. Sentences containing domain-specific terms, such as “Window” and “registry,” introduce a distinct context that differs from general English. This distinction allows our method, based on the domain-specific language model, to capture and interpret the contextual meaning more accurately than traditional models. In addition, TTPXHunter can identify the range of \(193\) TTPs with \(0.92\) f1-score, whereas TTPHunter is limited to only \(50\) TTPs. The improvement in the result and the capability to identify all ranges of TTPs make TTPXHunter superior to TTPHunter.
Further, to understand the efficiency of the data augmentation method and compare the performance of TTPXHunter and TTPHunter on the same base, we evaluate TTPXHunter on the ground of TTPHunter. It’s important to note that we did not apply TTPXHunter directly to the base dataset due to the presence of a few TTPs with only one sample, which makes it challenging to split the data evenly into training and testing sets. Therefore, we decided to compare TTPXHunter using the base dataset of TTPHunter, which consists of 50 TTP classes. We only selected 50 TTP sets for which we developed TTPHunter and evaluated both. The obtained result is shown in
Figure 5. As we can see, TTPXHunter outperforms TTPHunter on the 50-TTP set ground of TTPHunter. It reflects that augmenting more samples for a 50-TTP set and employing a domain-specific language model enables the classifier to better understand the context of TTP.
5.2.2 Report-based Evaluation.
In the real-world, we have threat reports in the form of natural language rather than sentence-wise datasets, and these reports contain information along with TTP-related sentences. So, we evaluate the TTPXHunter on the report dataset, which contains each sample as threat report sentences and a list of TTP explained in the report. Extracting TTPs from threat reports is a multi-label problem because a list of TTP classes is expected as output for any given sample, i.e., threat report in this case. The evaluation of such classification also requires careful consideration because of multi-label classification.
Evaluation Metrics. In the multi-label problem, the prediction vector appears as a multi-hot vector rather than a one-hot vector of a multi-class problem. In the multi-label case, there may be a situation where not all expected TTP classes were predicted; instead, a subset of them is correctly predicted. However, the prediction may be wrong because the whole multi-hot vector does not match. For example, if the true label set contains
\(\{T1,T2,T3\}\) and the predicted label is
\(\{T2,T3\}\), then it may be considered to be a mismatch even though
\(T2\) and
\(T3\) are correctly classified. So, relying on accuracy may not be a good choice for multi-label problems [
25]; instead, we consider hamming loss as a performance metric to deal with such a scenario. The hamming loss measures the error rate label-wise [
10,
25,
33]. It calculates the ratio of incorrect labels to all labels. For given
\(k\) threat reports, the hamming loss is defined as
where,
\(y_{i}\) and
\(\hat{y}_{i}\) are multi-hot predicted labels and true label for
\(i\mathrm{th}\) instance, respectively. The
\(\oplus\) represents element-wise exclusive OR operation. The low hamming loss represents that models make minimal wrong predictions.
Further, we also evaluate macro precision, recall, and f1-score by leveraging a multi-label confusion matrix package from sklearn [
19]. Then, we calculate true positive, false positive, and false negative for each class and calculate these performance metrics. Further, we calculate the macro average between all classes to get macro-averaged performance metrics for all chosen measures, i.e., precision, recall, and f1-score. We prefer the macro-average method to ignore biases toward the majority class and provide equal weight to all classes.
We consider four state-of-the-art methods, i.e., [
4,
14,
15,
17] for comparison against TTPXHunter based on these metrics over the report dataset. This comparison aims to understand the effectiveness of TTPXHunter over state-of-the-art for TTP extraction from finished threat reports. These methods provide the list of TTPs extracted from the given threat report and the model confidence score for each. Out of all extracted TTPs, only relevant TTPs are selected based on the threshold mechanism decided by each method. We evaluate the state-of-the-art method’s performance based on the threshold value given in their respective articles. For TTPXHunter, we obtain the same threshold experimentally chosen by TTPHunter, i.e.,
\(0.644\). The results obtained from all implemented methods on our report dataset are shown in
Table 4. It demonstrates that TTPXHunter outperforms all implemented methods across all chosen metrics, i.e., the lowest hamming loss and the highest other performance metrics. It achieves the highest f1-score of
\(97.09\%\), whereas out of all state-of-the-art methods, LADDER [
4] performs better than other state-of-the-art methods and achieves
\(2\mathrm{nd}\) highest performance of
\(93.90\%\) f1-score. This performance gain over state-of-the-art methods demonstrates the efficiency of the TTPXHunter, and we plan to make it open for the benefit of the community.
As this experiment involves \(193\) target TTP classes, it is challenging to visualize the class-wise performance of the employed models. Therefore, we follow a different way to assess the class-wise efficiency of the employed models. We count the number of TTP classes whose chosen performance metrics lie within a range interval. We employ a range interval of \(0.1\), i.e., \(10\%\), to calculate the number of TTP classes whose score falls into that range.
We calculate this across all five methods and three chosen performance metrics, i.e., precision, recall, and f1-score. The obtained results are present in
Figures 6–
8. The observation reveals that most TTP classes analyzed by rcATT fall within the
\(0-0.10\) range, contributing to its overall lower performance. This performance is due to the reliance on the TF-IDF method to transform sentences into vectors. TF-IDF is a numerical statistic that reflects how important a word is to a document in a collection or corpus, balancing its frequent appearance within a document against its commonness across all documents [
26,
27,
29]. This method lacks the ability to understand the context and semantic relationships between words, making it unable to grasp the overall meaning of sentences [
18,
28]. ATTACKG, conversely, exhibits a lower score within the
\(0-0.10\) range for certain TTP classes, which adversely affects its overall performance. However, TRAM shows a minimum performance range of
\(0.3-0.4\) for TTP classes, which indicates better performance than rcATT and ATTACKG. LADDER ensures a performance score of at least
\(0.4-0.5\) for a TTP class, which positions it ahead of the aforementioned methods, including rcATT, ATTACKG, and TRAM. Our proposed model, TTPXHunter, assures a minimum score of
\(0.6-0.7\) for TTP classes, with the majority exhibiting scores between
\(0.9-1.0\), which underscores TTPXHunter’s significant advantage over the other methods. It demonstrates the effectiveness of domain-specific models for domain-specific downstream tasks.