Abstract
Previous research has shown that hidden Markov model (HMM) analysis is useful for detecting certain challenging classes of malware. In this research, we consider the related problem of malware classification based on HMMs. We train multiple HMMs on a variety of compilers and malware generators. More than 8,000 malware samples are then scored against these models and separated into clusters based on the resulting scores. We observe that the clustering results could be used to classify the malware samples into their appropriate families with good accuracy. Since none of the malware families in the test set were used to generate the HMMs, these results indicate that our approach can effective classify previously unknown malware, at least in some cases. Thus, such a clustering strategy could serve as a useful tool in malware analysis and classification.








Similar content being viewed by others
Notes
In a row stochastic matrix, each row defines a probability distribution. That is, each element is in the range of 0 to 1, and the elements of any row must sum to 1.
References
Annachhatre, C.: Hidden Markov models for malware classification. Department of Computer Science, San Jose State University, Master’s report (2013)
Attaluri, S., McGhee, S., Stamp, M.: Profile hidden Markov models and metamorphic virus detection. J. Comput. Virol. 5(2), 151–169 (2009)
Austin, T., Filiol, E., Josse, S., Stamp, M.: Exploring hidden Markov models for virus analysis: a semantic approach. In: 46th Hawaii International Conference on System Sciences (HICSS 46), pp. 5039–5048 (2013)
Baysa, D., Low, R.M., Stamp, M.: Structural entropy and metamorphic malware. J. Comput. Virol. Hacking Tech. 9(4), 179–192 (2013)
Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 30, 1145–1159 (1997)
Canzanese, R., Kam, M., Mancoridis, S.: Toward an automatic, online behavioral malware classification system. https://www.cs.drexel.edu/~spiros/papers/saso2013.pdf (2013)
Cesare, S., Xiang, Y.: Classification of Malware using structured control flow. In: 8th Australasian Symposium on Parallel and Distributed Computing, vol. 107, pp. 61–70 (2010)
Do, C.B., Batzoglou, S.: What is the expectation maximization algorithm? Nat. Biotechnol. 26(8), 897–899. http://ai.stanford.edu/~chuongdo/papers/em_tutorial.pdf (2008)
Indika: Difference between hierarchical and partitional clustering. http://www.differencebetween.com/difference-between-hierarchical-and-vs-partitional-clustering (2011)
Jain, A., Dubes, R.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)
Jin, R.: Cluster validation. http://www.cs.kent.edu/~jin/DM08/ClusterValidation.pdf (2008)
Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11–21 (1972)
Kolter, S., Maloof, M.: Learning to detect and classify malicious executables in the wild. J. Mach. Learn. Res. 7, 2721–2744 (2006)
Krogh, A.: An Introduction to Hidden Markov Models for Biological Sequences. Computational Methods in Molecular Biology. Elsevier, Lyngby (1998)
Krogh, A., et al.: Hidden Markov models in computational biology: applications to protein modeling. J. Mol. Biol. 235(5), 1501–1531 (1994)
Lakhotia, A., Walenstein, A., Miles, C., Singh, A.: VILO: a rapid learning nearest-neighbor classifier for malware triage. J. Comput. Virol. 9(3), 109–123 (2013)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Maloof, M.A.: Machine Learning and Data Mining for Computer Security: Methods and Applications. Springer, Berlin (2006)
Ming, X., et al.: A similarity metric method of obfuscated malware using function-call graph. J. Comput. Virol. Hacking Tech. 9(1), 35–47 (2013)
MITRE: Malware attribute enumeration and characterization. http://maec.mitre.org (2013)
Moore, A.W.: \(K\)-Means and hierarchical clustering. http://www.autonlab.org/tutorials/kmeans11.pdf (2001)
Nappa, A., Zubair Rafique, M., Caballero, J.: Driving in the cloud: an analysis of drive-by download operations and abuse reporting of viruses. In: Proceedings of the 10th Conference on Detection of Intrusions and Malware & Vulnerability Assessment (2013)
Park, Y., Reeves, D.S., Stamp, M.: Deriving common malware behavior through graph clustering. Comput. Secur. 39(B), 419–430 (2013)
Rabiner, L.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)
Runwal, N., Low, R., Stamp, M.: Opcode graph similarity and metamorphic detection. J. Comput. Virol. 8, 37–52 (2012)
Saleh, M., Mohamed, A., Nabi, A.: Eigenviruses for metamorphic virus recognition. IET Inf. Secur. 5(4), 191–198 (2011)
Skulason, F., Solomon, A., Bontchev, V.: CARO naming scheme. http://www.caro.org/naming/scheme.html (1991)
Sorokin, I.: Comparing files using structural entropy. J. Comput. Virol. 7(4), 259–265 (2011)
Sridhara, S.M., Stamp, M.: Metamorphic worm that carries its own morphing engine. J. Comput. Virol. Hacking Tech. 9(2), 49–58 (2013)
Stamp, M.: A revealing introduction to hidden Markov models. http://www.cs.sjsu.edu/faculty/stamp/RUA/HMM.pdf (2012)
Swimmer, M.: Response to the proposal for a “C virus” database. ACM SIGSAC Review, vol. 8, pp. 1–5. http://www.odysci.com/article/1010112993890087 (1990)
Symantec: Trojan.Zbot. http://www.symantec.com/security_response/writeup.jsp?docid=2010-011016-3514-99 (2010)
Symantec Security Response: Trojan.Zeroaccess. http://www.symantec.com/security_response/writeup.jsp?docid=2011-071314-0410-99 (2011)
Virus Removal Services: Beware of FAKE antivirus—Winwebsec. http://virus.myfirstattempt.com/2012/11/beware-of-fake-anti-virus-winwebsec.html (2012)
VX Heavens. http://vx.netlux.org/ (2013)
Wong, W., Stamp, M.: Hunting for metamorphic engines. J. Comput. Virol. 2(3), 211–229 (2006)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Annachhatre, C., Austin, T.H. & Stamp, M. Hidden Markov models for malware classification. J Comput Virol Hack Tech 11, 59–73 (2015). https://doi.org/10.1007/s11416-014-0215-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11416-014-0215-x