Abstract
This paper describes a novel method aiming to cluster datasets containing malware behavioural data. Our method transform the data into an standardised data matrix that can be used in any clustering algorithm, finds the number of clusters in the data set and includes an optional visualization step for high-dimensional data using principal component analysis. Our clustering method deals well with categorical data, and it is able to cluster the behavioural data of 17,000 websites, acquired with Capture-HPC, in less than 2 min.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Our new version of Capture-HPC will be available soon at https://projects.honeynet.org/capture-hpc. AxMock can be downloaded at http://code.google.com/p/axmock/
- 2.
3,273 from http://www.malwaredomainlist.com/update.php plus 13,763 from http://www.malware.com.br/lists.shtml
References
Lau B, Svajcer V. Measuring virtual machine detection in malware using DSD tracer. J Comput Virol. 2010;6(3):181–95. doi:10.1007/s11416-008-0096-y.
Capture-hpc. https://projects.honeynet.org/capture-hpc. Accessed Aug 2011
Cova M, Kruegel C, Vigna G. Detection and analysis of drive-by-download attacks and malicious javascript code. In: Proceedings of the 19th international conference on world wide web. Raleigh: ACM; 2010. p. 281–90. doi:10.1145/1772690.1772720.
Seifert C, Delwadia V, Komisarczuk P, Stirling D, Welch I. Measurement study on malicious web servers in the. nz domain. In: Boyd C, Nieto JC, editors. Information security and privacy. New York: Lecture Notes in Computer Science; 2009. p. 8–25. doi:10.1007/978-3-642-02620-1_2.
Seifert C, Komisarczuk P, Welch I. True positive cost curve: a cost-based evaluation method for high-interaction client honeypots. In: Third international conference on emerging security information, systems and technologies (SECURWARE’09). Athens: IEEE; 2009. p. 63–9. doi:10.1109/SECURWARE.2009.17.
Bellman R. Dynamic programming and lagrange multipliers. Proc Nat Acad Sci USA. 1956;42(10):767–9. doi:10.1090/S0025-5718-1959-0107376-8.
Mirkin BG. Clustering for data mining: a data recovery approach, vol. 3. Boca Raton: Chapman and Hall/CRC; 2005.
Hartigan JA. Willey series in probability and mathematical statistics. New York: Wiley; 1975.
Ball GH, Hall DJ. A clustering technique for summarizing multivariate data. Behav Sci. 1967;12(2):153–5. doi:10.1002/bs.3830120210.
MacQueen, J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1. Bekerley. University of California Press; 1967. p. 281–97.
de Amorim RC. An empirical evaluation of different initializations on the number of K-means iterations. In: Batyrshin I, Mendoza MG, editors. Advances in artificial intelligence, vol. 7629. New York: Springer, Lecture Notes in Computer Science; 2013. p. 15–26. doi:10.1007/978-3-642-37807-2_2.
Bailey M, Oberheide J, Andersen J, Mao Z, Jahanian F, Nazario J. Automated classification and analysis of internet malware. In: Kruegel C, Lippmann L, Andrew C, editors. Recent advances in intrusion detection. New York: Lecture Notes in Computer Science; 2007. p. 178–97. doi:10.1007/978-3-540-74320-0_10.
Jain AK. Data clustering: 50 years beyond K-means. Pattern Recogn Lett. 2010;31(8):651–66. doi:10.1016/j.patrec.2009.09.011.
Hartigan JA, Wong MA. Algorithm as 136: a K-means clustering algorithm. J Roy Stat Soc: Ser C (Appl Stat). 1979;28(1):100–8. doi:10.2307/2346830.
Bayer U, Comparetti P, Hlauschek C, Kruegel C, Kirda E. Scalable, behavior-based malware clustering. In: Proceedings of the 16th annual network and distributed system security symposium (NDSS). San Diego: Internet Society; 2009.
Yen TF, Reiter M. Traffic aggregation for malware detection. In: Zamboni D, editor. Detection of intrusions and malware, and vulnerability assessment. New York: Springer, Lecture Notes in Computer Science; 2008. p. 207–27. doi:10.1007/978-3-540-70542-0_11.
Chiang MMT, Mirkin B. Intelligent choice of the number of clusters in K-means clustering: an experimental study with different cluster spreads. J Classif. 2010;27(1):3–40. doi:10.1007/s00357-010-9049-5.
Kaufman L, Rousseeuw PJ, et al. Finding groups in data: an introduction to cluster analysis. Wiley series in probability and statistics, vol. 39. New Jersey: Wiley Online Library; 1990. doi:10.1002/9780470316801.
Pelleg D, Moore A. X-means: extending K-means with efficient estimation of the number of clusters. In: Proceedings of the seventeenth international conference on machine learning. San Francisco: Stanford; 2000. p. 727–34.
de Amorim RC, Mirkin B. Minkowski metric, feature weighting and anomalous cluster initializing in K-means clustering. Pattern Recogn. 2012;45(3):1061–175. doi:10.1016/j.patcog.2011.08.012.
de Amorim RC. Constrained clustering with minkowski weighted K-means. In: 13th international symposium on computational intelligence and informatics. Budapest: IEEE Press; 2012. p. 13–7. doi:10.1109/CINTI.2012.6496753.
Stanforth RW, Kolossov E, Mirkin B. A measure of domain of applicability for qsar modelling based on intelligent K-means clustering. QSAR Comb Sci. 2007;26(7):837–44. doi:10.1002/qsar.200630086.
France SL, Douglas CJ, Xiong H. Distance metrics for high dimensional nearest neighborhood recovery: compression and normalization. Inf Sci. 2012;184(1):92–110. doi:10.1016/j.ins.2011.07.048.
Aggarwal CC, Hinneburg A, Keim DA. On the surprising behavior of distance metrics in high dimensional space. In: Bussche JVD, Vianu V, editors. Database theory, vol. 1973. New York: Springer, Lecture Notes in Computer Science; 2001. p. 420–34. doi:10.1007/3-540-44503-X_27.
Beyer K, Goldstein J, Ramakrishnan R, Shaft U. When is nearest neighbor meaningful? vol. 1540. New York: Springer, Lecture Notes in Computer Science; 1999. p. 217–35. doi:10.1007/3-540-49257-7_15.
France S, Carroll D. Is the distance compression effect overstated? some theory and experimentation. In: Perner P, editor. Machine learning and data mining in pattern recognition, vol. 5632. New York: Springer, Lecture Notes in Computer Science; 2009. p. 280–94. doi:10.1007/978-3-642-03070-3_21.
Strehl A, Ghosh J, Mooney R. Impact of similarity measures on web-page clustering. In: Proceedings of the 17th national conference on artificial intelligence: workshop of artificial intelligence for web search. Austin 2000; p. 58–64.
Qian G, Sural S, Gu Y, Pramanik S. Similarity between euclidean and cosine angle distance for nearest neighbor queries. In: Proceedings of the 2004 ACM symposium on applied computing. Nicosia: ACM; 2004. p. 1232–7. doi:10.1145/967900.968151.
Zhao Y, Karypis G. Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach Learn. 2004;55(3):311–31. doi:10.1023/B:MACH.0000027785.44527.d6.
Acknowledgments
The authors would like to thanks Tiffany Youzhi Bao for her instrumental work developing AxMock ultimately allowing us to upgrade Capture-HPC to mock ActiveX components.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this chapter
Cite this chapter
de Amorim, R.C., Komisarczuk, P. (2014). Partitional Clustering of Malware Using K-Means. In: Blackwell, C., Zhu, H. (eds) Cyberpatterns. Springer, Cham. https://doi.org/10.1007/978-3-319-04447-7_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-04447-7_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-04446-0
Online ISBN: 978-3-319-04447-7
eBook Packages: Computer ScienceComputer Science (R0)