Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Extraction of Phrase-based Concepts in Vulnerability Descriptions through Unsupervised Labeling

Published: 22 July 2023 Publication History

Abstract

Software vulnerabilities, once disclosed, can be documented in vulnerability databases, which have great potential to advance vulnerability analysis and security research. People describe the key characteristics of software vulnerabilities in natural language mixed with domain-specific names and concepts. This textual nature poses a significant challenge for the automatic analysis of vulnerability knowledge embedded in text. Automatic extraction of key vulnerability aspects is highly desirable but demands significant effort to manually label data for model training.
In this article, we propose unsupervised methods to label and extract important vulnerability concepts in textual vulnerability descriptions (TVDs). We focus on six types of phrase-based vulnerability concepts (vulnerability type, vulnerable component, root cause, attacker type, impact, and attack vector) as they are much more difficult to label and extract than name- or number-based entities (i.e., vendor, product, and version). Our approach is based on a key observation that the same-type of phrases, no matter how they differ in sentence structures and phrase expressions, usually share syntactically similar paths in the sentence parsing trees. Specifically, we present a source-target neural architecture that learns the Part-of-Speech (POS) tagging to identify a token’s functional role within TVDs, where the source neural model is trained to capture common features found in the TVD corpus, and the target model is trained to identify linguistically malformed words specific to the security domain. Our evaluation confirms that the proposed tagger outperforms (4.45%–5.98%) the taggers designed on natural language notions and identifies a broad set of TVDs and natural language contents. Then, based on the key observations, we propose two path representations (absolute paths and relative paths) and use an auto-encoder to encode such syntactic similarities. To address the discrete nature of our paths, we enhance the traditional Variational Auto-encoder (VAE) with Gumble-Max trick for categorical data distribution and thus create a Categorical VAE (CaVAE). In the latent space of absolute and relative paths, we further apply unsupervised clustering techniques to generate clusters of the same-type of concepts. Our evaluation confirms the effectiveness of our CaVAE, which achieves a small (85.85) log-likelihood for encoding path representations and the accuracy (83%–89%) of vulnerability concepts in the resulting clusters.
The resulting clusters accurately label six types of vulnerability concepts from a TVD corpus in an unsupervised way. Furthermore, these labeled vulnerability concepts can be mapped back to the corresponding phrases in the original TVDs, which produce labels of six types of vulnerability concepts. The resulting labeled TVDs can be used to train concept extraction models for other TVD corpora. In this work, we present two concept extraction methods (concept classification and sequence labeling model) to demonstrate the utility of the unsupervisedly labeled concepts. Our study shows that models trained with our unsupervisedly labeled vulnerability concepts outperform (3.9%–5.14%) those trained with the two manually labeled TVD datasets from previous work due to the consistent boundary and typing by our unsupervised labeling method.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, and Jeffrey Dean. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation 265–283.
[2]
Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics.Emily M. Bender, Leon Derczynski, and Pierre Isabelle (Eds.), 1638–1649.
[3]
AtefehZafarian, Ali Rokni, Shahram Khadivi, and Sonia Ghiasifard. 2015. Semi-supervised learning for named entity recognition using weakly labeled training data. In Proceedings of the 3rd International Symposium on Artificial Intelligence and Signal Processing. 129–135.
[4]
Hodaya Binyamini, Ron Bitton, Masaki Inokuchi, Tomohiko Yagyu, Yuval Elovici, and Asaf Shabtai. 2020. An automated, end-to-end framework for modeling attacks from vulnerability Descriptions. arXiv:2008.04377. Retrieved from https://arxiv.org/abs/2008.04377.
[5]
Steven Bird. 2006. NLTK: The natural language toolkit. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference.69–72.
[6]
Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of the 19th International Conference on Computational Statistics.177–186.
[7]
Robert A. Bridges, Kelly M. T. Huffer, Corinne L. Jones, Michael D. Iannacone, and John R. Goodall. 2017. Cybersecurity automated information extraction techniques: Drawbacks of current methods, and enhanced extractors. In Proceedings of the 16th IEEE International Conference on Machine Learning and Applications.437–442.
[8]
Robert A. Bridges, Corinne L. Jones, Michael D. Iannacone, and John R. Goodall. 2013. Automatic labeling for entity extraction in cyber security. arXiv:1308.4941. Retrieved from https://arxiv.org/abs/1308.4941.
[9]
Haipeng Chen, Jing Liu, Rui Liu, Noseong Park, and V. S. Subrahmanian. 2019. VEST: A system for vulnerability exploit scoring & timing. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. 6503–6505. Francois Chollet. 2015. Keras. Retrieved from https://keras.io.
[10]
Francois Chollet. 2015. Keras. Retrieved from https://keras.io.
[11]
Michael Collins. 2002. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of the 7th Conference on Empirical Methods in Natural Language Processing.1–8.
[12]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from https://arxiv.org/abs/1810.04805.
[13]
Ying Dong, Wenbo Guo, Yueqi Chen, Xinyu Xing, Yuqing Zhang, and Gang Wang. 2019. Towards the detection of inconsistencies in public security vulnerability reports. In Proceedings of the 28th USENIX Security Symposium. 14-16. 869–885.
[14]
Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. 226–231.
[15]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A pre-trained model for programming and natural languages. arXiv:2002.08155. Retrieved from https://arxiv.org/abs/2002.08155.
[16]
Houssem Gasmi, Jannik Laval, and Abdelaziz Bouras. 2019. Information extraction of cybersecurity concepts: An LSTM approach. Applied Sciences 9, 19 (2019).
[17]
Google. 2019. Word2vec. Retrieved August 30, 2021 from https://code.google.com/archive/p/word2vec/.
[18]
Emil Julius Gumbel. 1954. Statistical Theory of Extreme Values and Some Practical Applications: A Series of Lectures. US Government Printing Office.
[19]
Taha D. Günes, Long Tran-Thanh, and Timothy J. Norman. 2019. Identifying vulnerabilities in trust and reputation systems. In Proceedings of the 28th International Joint Conference on Artificial Intelligence.308–314.
[20]
Hao Guo, Zhenchang Xing, Sen Chen, Xiaohong Li, Yude Bai, and Hu Zhang. 2021. Key aspects augmentation of vulnerability description based on multiple security databases. In Proceedings of the 45th IEEE Annual Computers, Software, and Applications Conference. 1020–1025.
[21]
IBM. 2019. IBM X-Force Exchange. Retrieved June 30, 2021 from https://exchange.xforce.ibmcloud.com/.
[22]
Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical reparameterization with gumbel-softmax. In Proceedings of the 5th International Conference on Learning Representations.
[23]
MITRE Jonathan Evans. 2020. key details phrasing. Retrieved June, 2021 from http://cveproject.github.io/docs/content/key-details-phrasing.pdf.
[24]
Corinne L. Jones, Robert A. Bridges, Kelly M. T. Huffer, and John R. Goodall. 2015. Towards a relation extraction framework for cyber-security concepts. In Proceedings of the 10th Annual Cyber and Information Security Research Conference.11:1–11:4.
[25]
Arnav Joshi, Ravendar Lal, Tim Finin, and Anupam Joshi. 2013. Extracting cybersecurity related linked data from text. In Proceedings of the 7th IEEE International Conference on Semantic Computing.252–259.
[26]
Simran K, Sriram S, Vinayakumar R, and K. P. Soman. 2020. Deep learning approach for intelligent named entity recognition of cyber security. arXiv:2004.00502. Retrieved from https://arxiv.org/abs/2004.00502.
[27]
Gyeongmin Kim, Chanhee Lee, Jaechoon Jo, and Heuiseok Lim. 2020. Automatic extraction of named entities of cyber threats using a deep Bi-LSTM-CRF network. International Journal of Machine Learning and Cybernetics 11, 10 (2020), 2341–2355.
[28]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations.
[29]
Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations.
[30]
Ivan Victor Krsul. 1998. Software Vulnerability Analysis. Purdue University.
[31]
J. Richard Landis and Gary G. Koch. 1977. An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics (1977), 363–374.
[32]
Jing Li, Peng Han, Xiangnan Ren, Jilin Hu, Lisi Chen, and Shuo Shang. 2023. Sequence labeling with meta-learning. IEEE Transactions on Knowledge and Data Engineering 35, 3 (2023), 3072–3086.
[33]
Jing Li, Shuo Shang, and Ling Shao. 2020. MetaNER: Named entity recognition with meta-learning. In Proceedings of the 20th Web Conference. 429–440.
[34]
Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. 2018. A Survey on deep learning for named entity recognition. arXiv:1812.09449. Retrieved from https://arxiv.org/abs/1812.09449.
[35]
Xiaojing Liao, Kan Yuan, XiaoFeng Wang, Zhou Li, Luyi Xing, and Raheem A. Beyah. 2016. Acing the IOC game: Toward automatic discovery and analysis of open-source cyber threat intelligence. In Proceedings of the 23rd ACM/SIGSAC Conference on Computer and Communications Security.755–766.
[36]
Swee Kiat Lim, Aldrian Obaja Muis, Wei Lu, and Ong Chen Hui. 2017. MalwareTextDB: A database for annotated malware articles. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 1557–1567.
[37]
George C. Linderman, Manas Rachh, Jeremy G. Hoskins, Stefan Steinerberger, and Yuval Kluger. 2019. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nature Methods 16, 3 (2019), 243–245.
[38]
Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. 2017. The concrete distribution: A continuous relaxation of discrete random variables. In Proceedings of the 5th International Conference on Learning Representations.
[39]
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of english: The penn treebank. Computational Linguistics 19, 2 (1993), 313–330.
[40]
Nikki McNeil, Robert A. Bridges, Michael D. Iannacone, Bogdan D. Czejdo, Nicolas Perez, and John R. Goodall. 2013. PACE: Pattern accurate computationally efficient bootstrapping for timely discovery of cyber-security concepts. In Proceedings of the 12th International Conference on Machine Learning and Applications. 60–65.
[41]
Peter Mell, Karen Scarfone, Sasha Romanosky, et al. 2007. A complete guide to the common vulnerability scoring system version 2.0. In Proceedings of the Published by FIRST-forum of Incident Response and Security Teams.
[42]
Tom Michael Mitchell. 1997. Machine Learning. McGraw-hill New York.
[43]
Corporation MITRE. 2017. National vulnerability database (NVD). Retrieved January 21, 2021 from https://nvd.nist.gov/.
[44]
Corporation MITRE. 2019. Common Vulnerabilities and Exposures (CVE). Retrieved June 30, 2021 from https://cve.mitre.org/.
[45]
Sudip Mittal, Prajit Kumar Das, Varish Mulwad, Anupam Joshi, and Tim Finin. 2016. CyberTwitter: Using Twitter to generate alerts for cybersecurity threats and vulnerabilities. In Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.860–867.
[46]
Sumit More, Mary Matthews, Anupam Joshi, and Tim Finin. 2012. A knowledge-based approach to intrusion detection modeling. In Proceedings of the 33rd IEEE Symposium on Security and Privacy Workshops. 75–81.
[47]
Varish Mulwad, Wenjia Li, Anupam Joshi, Tim Finin, and Krishnamurthy Viswanathan. 2011. Extracting information about security vulnerabilities from web text. In Proceedings of the 2nd IEEE/ACM International Joint Conferenceon Web Intelligence and Intelligent Agent Technology - Workshops.257–260.
[48]
Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning.807–814.
[49]
Lorenzo Neil, Sudip Mittal, and Anupam Joshi. 2018. Mining threat intelligence about open-source projects and libraries from code repository issues and bug reports. In Proceedings of the 16th IEEE International Conference on Intelligence and Security Informatics. 7–12.
[50]
NIST. 2017. National Institute of Standards and Technology (NIST). Retrieved June 21, 2021 from https://www.nist.gov/.
[51]
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv:1802.05365. Retrieved from https://arxiv.org/abs/1802.05365.
[52]
Aditya Pingle, Aritran Piplai, Sudip Mittal, Anupam Joshi, James Holt, and Richard Zak. 2019. RelExt: Relation extraction using deep learning approaches for cybersecurity knowledge graph improvement. In Proceedings of the 19th International Conference on Advances in Social Networks Analysis and Mining. 879–886.
[53]
Ya Qin, Guowei Shen, Wen-bo Zhao, Yan-ping Chen, Miao Yu, and Xin Jin. 2019. A network security entity recognition method based on feature template and CNN-BiLSTM-CRF. Frontiers of Information Technology and Electronic Engineering 20, 6 (2019), 872–884.
[54]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (2018), 1–12.
[55]
Nadia Rahmah and Imas Sukaesih Sitanggang. 2016. Determination of optimal epsilon (Eps) value on DBSCAN algorithm to clustering data on peatland hotspots in sumatra. IOP Conference Series: Earth and nvironmental Science 31, 1 (2016), 012012.
[56]
Lev-Arie Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the 13th Conference on Computational Natural Language Learning. 147–155.
[57]
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31th International Conference on Machine Learning.1278–1286.
[58]
Ernesto Rosario Russo, Andrea Di Sorbo, Corrado Aaron Visaggio, and Gerardo Canfora. 2019. Summarizing vulnerabilities’ descriptions to support experts during vulnerability assessment activities. Journal of Systems and Software 156 (2019), 84–99.
[59]
Offensive Security. 2019. Exploit Database. Retrieved June 30, 2021 from https://www.exploit-db.com/.
[60]
Khaled Shaalan. 2014. A survey of arabic named entity recognition and classification. Computational Linguistics 40, 2 (2014), 469–510.
[61]
Ravindra Singh and Naurang Singh Mangat. 2013. Elements of Survey Sampling. Springer Science and Business Media.
[62]
Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. 2016. How to train deep variational autoencoders and probabilistic ladder networks. arXiv:1602.02282. Retrieved from https://arxiv.org/abs/1602.02282.
[63]
Stanford. 2018. Stanford Parser. Retrieved January 30, 2021 from https://nlp.stanford.edu/software/lex-parser.shtml.
[64]
Stanford. 2018. Stanford Tagger. Retrieved January 30, 2021 from https://nlp.stanford.edu/software/tagger.shtml.
[65]
Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Arivazhagan, Xin Li, and Amelia Archer. 2019. Small and practical BERT models for sequence labeling. arXiv:1909.00100. Retrieved from https://arxiv.org/abs/1909.00100.
[66]
Stéphan Tulkens, Simon Suster, and Walter Daelemans. 2019. Unsupervised concept extraction from clinical text through semantic composition. Journal of Biomedical Informatics 91 (2019).
[67]
Satyanarayan Raju Vadapalli, George Hsieh, and Kevin S. Nauer. 2018. Twitterosint: Automated cybersecurity threat intelligence collection and analysis using twitter data. In Proceedings of the 16th International Conference on Security and Management. 220–226.
[68]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, 86 (2008), 2579–2605.
[69]
Phong Minh Vu, Tam The Nguyen, and Tung Thanh Nguyen. 2019. ALPACA: Advanced linguistic pattern and concept analysis framework for software engineering corpora. In Proceedings of the 20th ACM Southeast Conference. 249–252.
[70]
Sachini S. Weerawardhana, Subhojeet Mukherjee, Indrajit Ray, and Adele E. Howe. 2014. Automated extraction of vulnerability information for home computer security. In Proceedings of the 7th International Symposium.356–366.
[71]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, and Wolfgang Macherey. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144. Retrieved from https://arxiv.org/abs/1609.08144.
[72]
Zhifeng Xiao. 2017. Towards a two-phase unsupervised system for cybersecurity concepts extraction. In Proceedings of the 13th International Conference on Natural Computation, Fuzzy Systems, and Knowledge Discovery. 2161–2168.
[73]
Semih Yagcioglu, Mehmet Saygin Seyfioglu, Begum Citamak, Batuhan Bardak, Seren Guldamlasioglu, Azmi Yuksel, and Emin Islam Tatli. 2019. Detecting cybersecurity events from noisy short text. arXiv:1904.05054. Retrieved from https://arxiv.org/abs/1904.05054.
[74]
Sofonias Yitagesu, Zhenchang Xing, Xiaowang Zhang, Zhiyong Feng, Xiaohong Li, and Linyi Han. 2021. Unsupervised labeling and extraction of phrase-based concepts in vulnerability Descriptions. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering. 943–954.
[75]
Sofonias Yitagesu, Xiaowang Zhang, Zhiyong Feng, Xiaohong Li, and Zhenchang Xing. 2021. Automatic part-of-speech tagging for security vulnerability descriptions. In Proceedings of the 18th IEEE/ACM International Conference on Mining Software Repositories. 29–40.

Cited By

View all
  • (2024)Vision: Identifying Affected Library Versions for Open Source Software VulnerabilitiesProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695516(1447-1459)Online publication date: 27-Oct-2024
  • (2024)Comprehensive vulnerability aspect extractionApplied Intelligence10.1007/s10489-023-05262-454:3(2881-2899)Online publication date: 8-Feb-2024
  • (2024)Relation Extraction Techniques in Cyber Threat IntelligenceNatural Language Processing and Information Systems10.1007/978-3-031-70239-6_24(348-363)Online publication date: 25-Jun-2024

Index Terms

  1. Extraction of Phrase-based Concepts in Vulnerability Descriptions through Unsupervised Labeling

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Software Engineering and Methodology
      ACM Transactions on Software Engineering and Methodology  Volume 32, Issue 5
      September 2023
      905 pages
      ISSN:1049-331X
      EISSN:1557-7392
      DOI:10.1145/3610417
      • Editor:
      • Mauro Pezzè
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 July 2023
      Online AM: 09 February 2023
      Accepted: 26 November 2022
      Revised: 08 November 2022
      Received: 28 January 2022
      Published in TOSEM Volume 32, Issue 5

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Textual vulnerability descriptions
      2. phrase-based vulnerability concepts
      3. unsupervised representation learning
      4. clustering and concept labeling
      5. supervised concept extraction

      Qualifiers

      • Research-article

      Funding Sources

      • National Natural Science Foundation of China (NSFC)

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)334
      • Downloads (Last 6 weeks)36
      Reflects downloads up to 09 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Vision: Identifying Affected Library Versions for Open Source Software VulnerabilitiesProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695516(1447-1459)Online publication date: 27-Oct-2024
      • (2024)Comprehensive vulnerability aspect extractionApplied Intelligence10.1007/s10489-023-05262-454:3(2881-2899)Online publication date: 8-Feb-2024
      • (2024)Relation Extraction Techniques in Cyber Threat IntelligenceNatural Language Processing and Information Systems10.1007/978-3-031-70239-6_24(348-363)Online publication date: 25-Jun-2024

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media