Partitional Clustering of Malware Using K-Means

de Amorim, Renato Cordeiro; Komisarczuk, Peter

doi:10.1007/978-3-319-04447-7_18

Renato Cordeiro de Amorim³ &
Peter Komisarczuk⁴

1608 Accesses
3 Citations

Abstract

This paper describes a novel method aiming to cluster datasets containing malware behavioural data. Our method transform the data into an standardised data matrix that can be used in any clustering algorithm, finds the number of clusters in the data set and includes an optional visualization step for high-dimensional data using principal component analysis. Our clustering method deals well with categorical data, and it is able to cluster the behavioural data of 17,000 websites, acquired with Capture-HPC, in less than 2 min.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Clustering for malware classification

Article 27 January 2016

Family Matters: On the Investigation of [Malicious] Mobile Apps Clustering

Cluster Analysis of Malware Family Relationships

Notes

1.
Our new version of Capture-HPC will be available soon at https://projects.honeynet.org/capture-hpc. AxMock can be downloaded at http://code.google.com/p/axmock/
2.
3,273 from http://www.malwaredomainlist.com/update.php plus 13,763 from http://www.malware.com.br/lists.shtml

References

Lau B, Svajcer V. Measuring virtual machine detection in malware using DSD tracer. J Comput Virol. 2010;6(3):181–95. doi:10.1007/s11416-008-0096-y.
Article Google Scholar
Capture-hpc. https://projects.honeynet.org/capture-hpc. Accessed Aug 2011
Cova M, Kruegel C, Vigna G. Detection and analysis of drive-by-download attacks and malicious javascript code. In: Proceedings of the 19th international conference on world wide web. Raleigh: ACM; 2010. p. 281–90. doi:10.1145/1772690.1772720.
Seifert C, Delwadia V, Komisarczuk P, Stirling D, Welch I. Measurement study on malicious web servers in the. nz domain. In: Boyd C, Nieto JC, editors. Information security and privacy. New York: Lecture Notes in Computer Science; 2009. p. 8–25. doi:10.1007/978-3-642-02620-1_2.
Seifert C, Komisarczuk P, Welch I. True positive cost curve: a cost-based evaluation method for high-interaction client honeypots. In: Third international conference on emerging security information, systems and technologies (SECURWARE’09). Athens: IEEE; 2009. p. 63–9. doi:10.1109/SECURWARE.2009.17.
Bellman R. Dynamic programming and lagrange multipliers. Proc Nat Acad Sci USA. 1956;42(10):767–9. doi:10.1090/S0025-5718-1959-0107376-8.
Article MATH Google Scholar
Mirkin BG. Clustering for data mining: a data recovery approach, vol. 3. Boca Raton: Chapman and Hall/CRC; 2005.
Book Google Scholar
Hartigan JA. Willey series in probability and mathematical statistics. New York: Wiley; 1975.
Google Scholar
Ball GH, Hall DJ. A clustering technique for summarizing multivariate data. Behav Sci. 1967;12(2):153–5. doi:10.1002/bs.3830120210.
Article Google Scholar
MacQueen, J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1. Bekerley. University of California Press; 1967. p. 281–97.
Google Scholar
de Amorim RC. An empirical evaluation of different initializations on the number of K-means iterations. In: Batyrshin I, Mendoza MG, editors. Advances in artificial intelligence, vol. 7629. New York: Springer, Lecture Notes in Computer Science; 2013. p. 15–26. doi:10.1007/978-3-642-37807-2_2.
Bailey M, Oberheide J, Andersen J, Mao Z, Jahanian F, Nazario J. Automated classification and analysis of internet malware. In: Kruegel C, Lippmann L, Andrew C, editors. Recent advances in intrusion detection. New York: Lecture Notes in Computer Science; 2007. p. 178–97. doi:10.1007/978-3-540-74320-0_10.
Jain AK. Data clustering: 50 years beyond K-means. Pattern Recogn Lett. 2010;31(8):651–66. doi:10.1016/j.patrec.2009.09.011.
Article Google Scholar
Hartigan JA, Wong MA. Algorithm as 136: a K-means clustering algorithm. J Roy Stat Soc: Ser C (Appl Stat). 1979;28(1):100–8. doi:10.2307/2346830.
MATH Google Scholar
Bayer U, Comparetti P, Hlauschek C, Kruegel C, Kirda E. Scalable, behavior-based malware clustering. In: Proceedings of the 16th annual network and distributed system security symposium (NDSS). San Diego: Internet Society; 2009.
Google Scholar
Yen TF, Reiter M. Traffic aggregation for malware detection. In: Zamboni D, editor. Detection of intrusions and malware, and vulnerability assessment. New York: Springer, Lecture Notes in Computer Science; 2008. p. 207–27. doi:10.1007/978-3-540-70542-0_11.
Chiang MMT, Mirkin B. Intelligent choice of the number of clusters in K-means clustering: an experimental study with different cluster spreads. J Classif. 2010;27(1):3–40. doi:10.1007/s00357-010-9049-5.
Article MathSciNet Google Scholar
Kaufman L, Rousseeuw PJ, et al. Finding groups in data: an introduction to cluster analysis. Wiley series in probability and statistics, vol. 39. New Jersey: Wiley Online Library; 1990. doi:10.1002/9780470316801.
Pelleg D, Moore A. X-means: extending K-means with efficient estimation of the number of clusters. In: Proceedings of the seventeenth international conference on machine learning. San Francisco: Stanford; 2000. p. 727–34.
Google Scholar
de Amorim RC, Mirkin B. Minkowski metric, feature weighting and anomalous cluster initializing in K-means clustering. Pattern Recogn. 2012;45(3):1061–175. doi:10.1016/j.patcog.2011.08.012.
Article Google Scholar
de Amorim RC. Constrained clustering with minkowski weighted K-means. In: 13th international symposium on computational intelligence and informatics. Budapest: IEEE Press; 2012. p. 13–7. doi:10.1109/CINTI.2012.6496753.
Stanforth RW, Kolossov E, Mirkin B. A measure of domain of applicability for qsar modelling based on intelligent K-means clustering. QSAR Comb Sci. 2007;26(7):837–44. doi:10.1002/qsar.200630086.
Article Google Scholar
France SL, Douglas CJ, Xiong H. Distance metrics for high dimensional nearest neighborhood recovery: compression and normalization. Inf Sci. 2012;184(1):92–110. doi:10.1016/j.ins.2011.07.048.
Article Google Scholar
Aggarwal CC, Hinneburg A, Keim DA. On the surprising behavior of distance metrics in high dimensional space. In: Bussche JVD, Vianu V, editors. Database theory, vol. 1973. New York: Springer, Lecture Notes in Computer Science; 2001. p. 420–34. doi:10.1007/3-540-44503-X_27.
Beyer K, Goldstein J, Ramakrishnan R, Shaft U. When is nearest neighbor meaningful? vol. 1540. New York: Springer, Lecture Notes in Computer Science; 1999. p. 217–35. doi:10.1007/3-540-49257-7_15.
France S, Carroll D. Is the distance compression effect overstated? some theory and experimentation. In: Perner P, editor. Machine learning and data mining in pattern recognition, vol. 5632. New York: Springer, Lecture Notes in Computer Science; 2009. p. 280–94. doi:10.1007/978-3-642-03070-3_21.
Strehl A, Ghosh J, Mooney R. Impact of similarity measures on web-page clustering. In: Proceedings of the 17th national conference on artificial intelligence: workshop of artificial intelligence for web search. Austin 2000; p. 58–64.
Google Scholar
Qian G, Sural S, Gu Y, Pramanik S. Similarity between euclidean and cosine angle distance for nearest neighbor queries. In: Proceedings of the 2004 ACM symposium on applied computing. Nicosia: ACM; 2004. p. 1232–7. doi:10.1145/967900.968151.
Zhao Y, Karypis G. Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach Learn. 2004;55(3):311–31. doi:10.1023/B:MACH.0000027785.44527.d6.
Article MATH Google Scholar

Download references

Acknowledgments

The authors would like to thanks Tiffany Youzhi Bao for her instrumental work developing AxMock ultimately allowing us to upgrade Capture-HPC to mock ActiveX components.

Author information

Authors and Affiliations

Department of Computing, Glyndŵr University, Mold Road, Wrexham, LL11 2AW, UK
Renato Cordeiro de Amorim
School of Computing Technology, University of West London, St Mary’s Road, London, W5 5RF, UK
Peter Komisarczuk

Authors

Renato Cordeiro de Amorim
View author publications
You can also search for this author in PubMed Google Scholar
Peter Komisarczuk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Renato Cordeiro de Amorim .

Editor information

Editors and Affiliations

Department of Computing and Communication Technologies, Oxford Brookes University, Oxford, Oxfordshire, United Kingdom
Clive Blackwell
Department of Computing and Communication Technologies, Oxford Brookes University, Oxford, Oxfordshire, United Kingdom
Hong Zhu

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

de Amorim, R.C., Komisarczuk, P. (2014). Partitional Clustering of Malware Using K-Means. In: Blackwell, C., Zhu, H. (eds) Cyberpatterns. Springer, Cham. https://doi.org/10.1007/978-3-319-04447-7_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-04447-7_18
Published: 14 May 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-04446-0
Online ISBN: 978-3-319-04447-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Partitional Clustering of Malware Using K-Means

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Clustering for malware classification

Family Matters: On the Investigation of [Malicious] Mobile Apps Clustering

Cluster Analysis of Malware Family Relationships

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Partitional Clustering of Malware Using K-Means

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Clustering for malware classification

Family Matters: On the Investigation of [Malicious] Mobile Apps Clustering

Cluster Analysis of Malware Family Relationships

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation