research-article

Extraction of Phrase-based Concepts in Vulnerability Descriptions through Unsupervised Labeling

Authors:

Linyi HanAuthors Info & Claims

ACM Transactions on Software Engineering and Methodology, Volume 32, Issue 5

Article No.: 112, Pages 1 - 45

https://doi.org/10.1145/3579638

Published: 22 July 2023 Publication History

Get Access

Abstract

Software vulnerabilities, once disclosed, can be documented in vulnerability databases, which have great potential to advance vulnerability analysis and security research. People describe the key characteristics of software vulnerabilities in natural language mixed with domain-specific names and concepts. This textual nature poses a significant challenge for the automatic analysis of vulnerability knowledge embedded in text. Automatic extraction of key vulnerability aspects is highly desirable but demands significant effort to manually label data for model training.

In this article, we propose unsupervised methods to label and extract important vulnerability concepts in textual vulnerability descriptions (TVDs). We focus on six types of phrase-based vulnerability concepts (vulnerability type, vulnerable component, root cause, attacker type, impact, and attack vector) as they are much more difficult to label and extract than name- or number-based entities (i.e., vendor, product, and version). Our approach is based on a key observation that the same-type of phrases, no matter how they differ in sentence structures and phrase expressions, usually share syntactically similar paths in the sentence parsing trees. Specifically, we present a source-target neural architecture that learns the Part-of-Speech (POS) tagging to identify a token’s functional role within TVDs, where the source neural model is trained to capture common features found in the TVD corpus, and the target model is trained to identify linguistically malformed words specific to the security domain. Our evaluation confirms that the proposed tagger outperforms (4.45%–5.98%) the taggers designed on natural language notions and identifies a broad set of TVDs and natural language contents. Then, based on the key observations, we propose two path representations (absolute paths and relative paths) and use an auto-encoder to encode such syntactic similarities. To address the discrete nature of our paths, we enhance the traditional Variational Auto-encoder (VAE) with Gumble-Max trick for categorical data distribution and thus create a Categorical VAE (CaVAE). In the latent space of absolute and relative paths, we further apply unsupervised clustering techniques to generate clusters of the same-type of concepts. Our evaluation confirms the effectiveness of our CaVAE, which achieves a small (85.85) log-likelihood for encoding path representations and the accuracy (83%–89%) of vulnerability concepts in the resulting clusters.

The resulting clusters accurately label six types of vulnerability concepts from a TVD corpus in an unsupervised way. Furthermore, these labeled vulnerability concepts can be mapped back to the corresponding phrases in the original TVDs, which produce labels of six types of vulnerability concepts. The resulting labeled TVDs can be used to train concept extraction models for other TVD corpora. In this work, we present two concept extraction methods (concept classification and sequence labeling model) to demonstrate the utility of the unsupervisedly labeled concepts. Our study shows that models trained with our unsupervisedly labeled vulnerability concepts outperform (3.9%–5.14%) those trained with the two manually labeled TVD datasets from previous work due to the consistent boundary and typing by our unsupervised labeling method.

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, and Jeffrey Dean. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation 265–283.

Abstract

References

Cited By

Recommendations

Detecting and Augmenting Missing Key Aspects in Vulnerability Descriptions

Aspect-level Information Discrepancies across Heterogeneous Vulnerability Reports: Severity, Types and Detection Methods

Towards Practical Binary Code Similarity Detection: Vulnerability Verification via Patch Semantic Analysis

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Full Text

Share

Share this Publication link

Share on social media

Affiliations