Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Partitional Clustering of Malware Using K-Means

  • Chapter
  • First Online:
Cyberpatterns

Abstract

This paper describes a novel method aiming to cluster datasets containing malware behavioural data. Our method transform the data into an standardised data matrix that can be used in any clustering algorithm, finds the number of clusters in the data set and includes an optional visualization step for high-dimensional data using principal component analysis. Our clustering method deals well with categorical data, and it is able to cluster the behavioural data of 17,000 websites, acquired with Capture-HPC, in less than 2 min.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Our new version of Capture-HPC will be available soon at https://projects.honeynet.org/capture-hpc. AxMock can be downloaded at http://code.google.com/p/axmock/

  2. 2.

    3,273 from http://www.malwaredomainlist.com/update.php plus 13,763 from http://www.malware.com.br/lists.shtml

References

  1. Lau B, Svajcer V. Measuring virtual machine detection in malware using DSD tracer. J Comput Virol. 2010;6(3):181–95. doi:10.1007/s11416-008-0096-y.

    Article  Google Scholar 

  2. Capture-hpc. https://projects.honeynet.org/capture-hpc. Accessed Aug 2011

  3. Cova M, Kruegel C, Vigna G. Detection and analysis of drive-by-download attacks and malicious javascript code. In: Proceedings of the 19th international conference on world wide web. Raleigh: ACM; 2010. p. 281–90. doi:10.1145/1772690.1772720.

  4. Seifert C, Delwadia V, Komisarczuk P, Stirling D, Welch I. Measurement study on malicious web servers in the. nz domain. In: Boyd C, Nieto JC, editors. Information security and privacy. New York: Lecture Notes in Computer Science; 2009. p. 8–25. doi:10.1007/978-3-642-02620-1_2.

  5. Seifert C, Komisarczuk P, Welch I. True positive cost curve: a cost-based evaluation method for high-interaction client honeypots. In: Third international conference on emerging security information, systems and technologies (SECURWARE’09). Athens: IEEE; 2009. p. 63–9. doi:10.1109/SECURWARE.2009.17.

  6. Bellman R. Dynamic programming and lagrange multipliers. Proc Nat Acad Sci USA. 1956;42(10):767–9. doi:10.1090/S0025-5718-1959-0107376-8.

    Article  MATH  Google Scholar 

  7. Mirkin BG. Clustering for data mining: a data recovery approach, vol. 3. Boca Raton: Chapman and Hall/CRC; 2005.

    Book  Google Scholar 

  8. Hartigan JA. Willey series in probability and mathematical statistics. New York: Wiley; 1975.

    Google Scholar 

  9. Ball GH, Hall DJ. A clustering technique for summarizing multivariate data. Behav Sci. 1967;12(2):153–5. doi:10.1002/bs.3830120210.

    Article  Google Scholar 

  10. MacQueen, J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1. Bekerley. University of California Press; 1967. p. 281–97.

    Google Scholar 

  11. de Amorim RC. An empirical evaluation of different initializations on the number of K-means iterations. In: Batyrshin I, Mendoza MG, editors. Advances in artificial intelligence, vol. 7629. New York: Springer, Lecture Notes in Computer Science; 2013. p. 15–26. doi:10.1007/978-3-642-37807-2_2.

  12. Bailey M, Oberheide J, Andersen J, Mao Z, Jahanian F, Nazario J. Automated classification and analysis of internet malware. In: Kruegel C, Lippmann L, Andrew C, editors. Recent advances in intrusion detection. New York: Lecture Notes in Computer Science; 2007. p. 178–97. doi:10.1007/978-3-540-74320-0_10.

  13. Jain AK. Data clustering: 50 years beyond K-means. Pattern Recogn Lett. 2010;31(8):651–66. doi:10.1016/j.patrec.2009.09.011.

    Article  Google Scholar 

  14. Hartigan JA, Wong MA. Algorithm as 136: a K-means clustering algorithm. J Roy Stat Soc: Ser C (Appl Stat). 1979;28(1):100–8. doi:10.2307/2346830.

    MATH  Google Scholar 

  15. Bayer U, Comparetti P, Hlauschek C, Kruegel C, Kirda E. Scalable, behavior-based malware clustering. In: Proceedings of the 16th annual network and distributed system security symposium (NDSS). San Diego: Internet Society; 2009.

    Google Scholar 

  16. Yen TF, Reiter M. Traffic aggregation for malware detection. In: Zamboni D, editor. Detection of intrusions and malware, and vulnerability assessment. New York: Springer, Lecture Notes in Computer Science; 2008. p. 207–27. doi:10.1007/978-3-540-70542-0_11.

  17. Chiang MMT, Mirkin B. Intelligent choice of the number of clusters in K-means clustering: an experimental study with different cluster spreads. J Classif. 2010;27(1):3–40. doi:10.1007/s00357-010-9049-5.

    Article  MathSciNet  Google Scholar 

  18. Kaufman L, Rousseeuw PJ, et al. Finding groups in data: an introduction to cluster analysis. Wiley series in probability and statistics, vol. 39. New Jersey: Wiley Online Library; 1990. doi:10.1002/9780470316801.

  19. Pelleg D, Moore A. X-means: extending K-means with efficient estimation of the number of clusters. In: Proceedings of the seventeenth international conference on machine learning. San Francisco: Stanford; 2000. p. 727–34.

    Google Scholar 

  20. de Amorim RC, Mirkin B. Minkowski metric, feature weighting and anomalous cluster initializing in K-means clustering. Pattern Recogn. 2012;45(3):1061–175. doi:10.1016/j.patcog.2011.08.012.

    Article  Google Scholar 

  21. de Amorim RC. Constrained clustering with minkowski weighted K-means. In: 13th international symposium on computational intelligence and informatics. Budapest: IEEE Press; 2012. p. 13–7. doi:10.1109/CINTI.2012.6496753.

  22. Stanforth RW, Kolossov E, Mirkin B. A measure of domain of applicability for qsar modelling based on intelligent K-means clustering. QSAR Comb Sci. 2007;26(7):837–44. doi:10.1002/qsar.200630086.

    Article  Google Scholar 

  23. France SL, Douglas CJ, Xiong H. Distance metrics for high dimensional nearest neighborhood recovery: compression and normalization. Inf Sci. 2012;184(1):92–110. doi:10.1016/j.ins.2011.07.048.

    Article  Google Scholar 

  24. Aggarwal CC, Hinneburg A, Keim DA. On the surprising behavior of distance metrics in high dimensional space. In: Bussche JVD, Vianu V, editors. Database theory, vol. 1973. New York: Springer, Lecture Notes in Computer Science; 2001. p. 420–34. doi:10.1007/3-540-44503-X_27.

  25. Beyer K, Goldstein J, Ramakrishnan R, Shaft U. When is nearest neighbor meaningful? vol. 1540. New York: Springer, Lecture Notes in Computer Science; 1999. p. 217–35. doi:10.1007/3-540-49257-7_15.

  26. France S, Carroll D. Is the distance compression effect overstated? some theory and experimentation. In: Perner P, editor. Machine learning and data mining in pattern recognition, vol. 5632. New York: Springer, Lecture Notes in Computer Science; 2009. p. 280–94. doi:10.1007/978-3-642-03070-3_21.

  27. Strehl A, Ghosh J, Mooney R. Impact of similarity measures on web-page clustering. In: Proceedings of the 17th national conference on artificial intelligence: workshop of artificial intelligence for web search. Austin 2000; p. 58–64.

    Google Scholar 

  28. Qian G, Sural S, Gu Y, Pramanik S. Similarity between euclidean and cosine angle distance for nearest neighbor queries. In: Proceedings of the 2004 ACM symposium on applied computing. Nicosia: ACM; 2004. p. 1232–7. doi:10.1145/967900.968151.

  29. Zhao Y, Karypis G. Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach Learn. 2004;55(3):311–31. doi:10.1023/B:MACH.0000027785.44527.d6.

    Article  MATH  Google Scholar 

Download references

Acknowledgments

The authors would like to thanks Tiffany Youzhi Bao for her instrumental work developing AxMock ultimately allowing us to upgrade Capture-HPC to mock ActiveX components.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Renato Cordeiro de Amorim .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this chapter

Cite this chapter

de Amorim, R.C., Komisarczuk, P. (2014). Partitional Clustering of Malware Using K-Means. In: Blackwell, C., Zhu, H. (eds) Cyberpatterns. Springer, Cham. https://doi.org/10.1007/978-3-319-04447-7_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-04447-7_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-04446-0

  • Online ISBN: 978-3-319-04447-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics