survey

Public Access

A Survey on Malware Detection Using Data Mining Techniques

Authors:

Donald Adjeroh,

S. Sitharama IyengarAuthors Info & Claims

ACM Computing Surveys (CSUR), Volume 50, Issue 3

Article No.: 41, Pages 1 - 40

https://doi.org/10.1145/3073559

Published: 29 June 2017 Publication History

Abstract

In the Internet age, malware (such as viruses, trojans, ransomware, and bots) has posed serious and evolving security threats to Internet users. To protect legitimate users from these threats, anti-malware software products from different companies, including Comodo, Kaspersky, Kingsoft, and Symantec, provide the major defense against malware. Unfortunately, driven by the economic benefits, the number of new malware samples has explosively increased: anti-malware vendors are now confronted with millions of potential malware samples per year. In order to keep on combating the increase in malware samples, there is an urgent need to develop intelligent methods for effective and efficient malware detection from the real and large daily sample collection. In this article, we first provide a brief overview on malware as well as the anti-malware industry, and present the industrial needs on malware detection. We then survey intelligent malware detection methods. In these methods, the process of detection is usually divided into two stages: feature extraction and classification/clustering. The performance of such intelligent malware detection approaches critically depend on the extracted features and the methods for classification/clustering. We provide a comprehensive investigation on both the feature extraction and the classification/clustering techniques. We also discuss the additional issues and the challenges of malware detection using data mining techniques and finally forecast the trends of malware development.

References

[1]

Tony Abou-As saleh, Nick Cercone, Vlado Keselj, and Ray Sweidan. 2004. N-gram-based detection of new malicious code. In Proceedings of the 28th Annual International Computer Software and Applications Conference (COMPSAC).

Digital Library

[2]

David W. Aha, Dennis Kibler, and Marc K. Albert. 1991. Instance-based learning algorithms. Machine Learning 6, 1 (1991), 37--66.

[3]

Blake Anderson, Daniel Quist, Joshua Neil, Curtis Storlie, and Terran Lane. 2011. Graph based malware detection using dynamic analysis. Journal in Computer Virology 4 (2011), 247--258.

Digital Library

[4]

Blake Anderson, Curtis Storlie, and Terran Lane. 2012. Improving malware classification: Bridging the static/dynamic gap. In Proceedings of 5th ACM Workshop on Security and Artificial Intelligence (AISec).

Digital Library

[5]

Anubis. 2010. Anubis: Analyzing Unknown Binaries. Retrieved from http://anubis.iseclab.org/.

[6]

Michael Bailey, Jon Oberheide, Jon Andersen, Z. Morley Mao, Farnam Jahanian, and Jose Nazario. 2007. Automated classification and analysis of internet malware. In Proceedings of the 10th International Conference on Recent Advances in Intrusion Detection.

Digital Library

[7]

Ulrich Bayer, Paolo Milani Comparetti, Clemens Hlauschek, Christopher Kruegel, and Engin Kirda. 2009. Scalable, behavior-based malware clustering. In Proceedings of the 16th Annual Network and Distributed System Security Symposium.

[8]

Ulrich Bayer, Christopher Kruegel, and Engin Kirda. 2006a. TTAnalyze: A tool for analyzing malware. In EICAR.

[9]

Ulrich Bayer, Andreas Moser, Christopher Kruegel, and Engin Kirda. 2006b. Dynamic analysis of malicious code. Journal in Computer Virology 2(1) (2006), 67--77.

[10]

Zahra Bazrafshan, Hashem Hashemi, Seyed Mehdi Hazrati Fard, and Ali Hamzeh. 2013. A survey on heuristic malware detection techniques. In Proceedings of the 5th Conference on Information and Knowledge Technology (IKT).

[11]

Philippe Beaucamps and ric Filiol. 2007. On the possibility of practically obfuscating programs towards a unified perspective of code protection. Journal in Computer Virology 3, 1 (2007), 3--21.

[12]

Yoshua Bengio. 2009. Learning deep architectures for AI. Foundations and Trends in Machine Learning 2, 1 (2009), 1--127.

Digital Library

[13]

Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. 2007. Greedy layer-wise training of deep networks. Advances in Neural Information Processing Systems 19 (2007).

[14]

Christopher M. Bishop. 1995. Neural networks for pattern recognition. Oxford, Clarendon Press.

Digital Library

[15]

Bizjournals. 2011. McAfee: Trends in a decade of cybercrime. Retrieved from http://www.bizjournals.com/sanjose/news/2011/01/25/mcafee-trends-in-a-decade-of-cybercrime.html?page=all.

[16]

Kevin Borders and Atul Prakash. 2004. Web tap: Detecting covert web traffic. In Proceedings of the 11th ACM Conference on Computer and Communications Security.

Digital Library

[17]

Leo Breiman. 1996. Bagging predicators. Machine Learning 24, 2 (1996), 123--140.

Digital Library

[18]

Leo Breiman. 2001. Random forests. Machine Learning 45, 1 (2001), 5--32.

Digital Library

[19]

Juan Caballero, Heng Yin, Zhenkai Liang, and Dawn Song. 2007. Polyglot: Automatic extraction of protocol message format using dynamic binary analysis. In Proceedings of the 14th ACM Conference on Computer and Communications Security (CCS).

Digital Library

[20]

Duen Horng Chau, Carey Nachenberg, Jeffrey Wilhelm, Adam Wright, and Christos Faloutsos. 2011. Polonium: Tera-scale graph mining for malware detection. In Proceedings of the SIAM International Conference on Data Mining (SDM).

[21]

Lingwei Chen, William Hardy, Yanfang Ye, and Tao Li. 2015. Analyzing file-to-file relation network in malware detection. In Proceedings of the International Conference on Web Information Systems Engineering (WISE).

Digital Library

[22]

Mihai Christodorescu and Somesh Jha. 2003. Static analysis of executables to detect malicious patterns. In Proceedings of the 12th Conference on USENIX Security Symposium.

Digital Library

[23]

Mihai Christodorescu, Somesh Jha, and Christopher Kruegel. 2007. Mining specifications of malicious behavior. In Proceedings of ESEC/FSE.

Digital Library

[24]

Mihai Christodorescu, Somesh Jha, Sanjit A. Seshia, Dawn Song, and Randal E. Bryant. 2005. Semantics-aware malware detection. In Proceedings of IEEE Symposium on Security and Privacy.

Digital Library

[25]

William W. Cohen. 1995. Fast effective rule induction. In Proceedings of 12th International Conference on Machine Learning.

Digital Library

[26]

Peter Coogan. 2010. SpyEye Bot Versus Zeus Bot. Retrieved from http://www.symantec.com/connect/blogs/spyeye-bot-versus-zeus-bot.

[27]

Thomas Cover and Peter Hart. 1967. Nearest nieghbor pattern classification. IEEE Transaction on Information Theory IT-13, 1 (1967), 21--27.

Digital Library

[28]

Jedidiah R. Crandall, Zhendong Su, S. Felix Wu, and Frederic T. Chong. 2005. On deriving unknown vulnerabilities from zero-day polymorphic and metamorphic worm exploits. In Proceedings of the 12th ACM Conference on Computer and Communications Security (CCS).

Digital Library

[29]

Nilesh Dalvi, Pedro Domingos, Mausam, Sumit Sanghai, and Deepak Verma. 2004. Adversarial classification. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 99--108.

Digital Library

[30]

Damballa. 2008. 3% to 5% of Enterprise Assets Are Compromised by Bot-Driven Targeted Attack Malware. Retrieved from http://www.prnewswire.com/news-releases/3-to-5-of-enterprise-assets-are-compromised-by-bot-driven-targeted-attack-malware-61634867.html.

[31]

Mohsen Damshenas, Ali Dehghantanha, and Ramlan Mahmoud. 2013. A survey on malware propagation, analysis, and detection. International Journal of Cyber-Security and Digital Forensics (IJCSDF) 2, 4 (2013), 10--29.

[32]

Sanjeev Das, Yang Liu, Wei Zhang, and Mahintham Chandramohan. 2016. Semantics-based online malware detection: Towards efficient real-time protection against malware. IEEE Transactions on Information Forensics and Security 11, 2 (2016), 289--302.

Digital Library

[33]

Thomas Dietterich. 1997. Machine learning research: Four current directions. Artificial Intelligence Magzine 18, 4 (1997), 97--36.

[34]

Thomas G. Dietterich. 2000. Ensemble methods in machine learning. In Proceedings of the 1st International Workshop on Multiple Classifier Systems.

Digital Library

[35]

Artem Dinaburg, Paul Royal, Monirul Sharif, and Wenke Lee. 2008. Ether: Malware analysis via hardware virtualization extensions. In Proceedings of the 15th ACM Conference on Computer and Communications Security (CCS).

Digital Library

[36]

Pedro Domingos and Michael Pazzani. 1997. On the optimality of simple Bayesian classifier under zero-one loss. Machine Learning 29, 2--3 (1997), 103--130.

Digital Library

[37]

Manuel Egele, Theodoor Scholte, Engin Kirda, and Christopher Kruegel. 2012. A survey on automated dynamic malware-analysis techniques and tools. ACM Computing Surveys (CSUR) 44, 2 (2012), 6.

Digital Library

[38]

Yuval Elovici, Asaf Shabtai, Robert Moskovitch, Gil Tahan, and Chanan Glezer. 2007. Applying machine learning techniques for detection of malicious code in network traffic. KI: Advances in Artificial Intelligence (2007).

Digital Library

[39]

EMarketer. 2014. Global B2C Ecommerce Sales to Hit &dollar;1.5 Trillion This Year Driven by Growth in Emerging Markets. Retrieved from http://www.emarketer.com/Article/Global-B2C-Ecommerce-Sales-Hit-15-Trillion-This-Year-Driven-by-Growth-Emerging-Markets/1010575.

[40]

Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Gomes Amorim. 2014. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research 15, 1 (2014), 3133--3181.

Digital Library

[41]

Eric Filiol, Gregoire Jacob, and Mickael Le Liard. 2007. Evaluation methodology and theoretical model for antiviral behavioural detection strategies. Journal in Computer Virology 3, 1 (2007), 23--37.

[42]

Ivan Firdausi, Alva Erwin, and Anto Satriyo Nugroho. 2010. Analysis of machine learning techniques used in behavior based malware detection. In Proceedings of 2nd International Conference on Advances in Computing, Control and Telecommunication Technologies (ACT).

Digital Library

[43]

Evelyn Fix and Joseph L. Hodges Jr. 1951. Discriminatory analysis-nonparametric discrimination: Consistency properties. US Air Force, School of Avaiation Medicine, Tech. Rep 4 (1951), 5--32.

[44]

Matt Fredrikson, Somesh Jha, Mihai Christodorescu, Reiner Sailer, and Xifeng Yan. 2010. Synthesizing near-optimal malware specifications from suspicious behaviors. In Proceedings of IEEE Symposium on Security and Privacy.

Digital Library

[45]

Yoav Freund and Robert E. Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comp. Syst. Sci. 55, 1 (1997), 119--39.

Digital Library

[46]

Ekta Gandotra, Divya Bansal, and Sanjeev Sofat. 2014. Malware analysis and classification: A survey. Journal of Information Security 5, 2 (2014), 56--64.

[47]

Maria Garnaeva, Victor Chebyshev, Denis Makrushin, Roman Unuchek, and Anton Ivanov. 2014. Kaspersky Security Bulletin 2014. Retrieved from http://securelist.com/analysis/kaspersky-security-bulletin/68010/kaspersky-security-bulletin-2014-overall-statistics-for-2014/.

[48]

Todd R. Golub, Donna K. Slonim, Pablo Tamayo, Christine Huard, Michelle Gaasenbeek, Jill P. Mesirov, and Hilary Coller. 1999. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286, 5439 (1999), 531--537.

[49]

Isabelle Guyon and Andr Elisseeff. 2003. An introduction to variable and feature selection. Jouranl of Machine Learning Research 3 (March 2003), 1157--1182.

Digital Library

[50]

Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter (2009).

Digital Library

[51]

William Hardy, Lingwei Chen, Shifu Hou, Yanfang Ye, and Xin Li. 2016. DL4MD: A deep learning framework for intelligent malware detection. In Proceedings of the International Conference on Data Mining (DMIN).

[52]

Olivier Henchiri and Nathalie Japkowicz. 2006a. A feature selection and evaluation scheme for computer virus detection. In Proceedings of the 6th International Conference on Data Mining.

Digital Library

[53]

Olivier Henchiri and Nathalie Japkowicz. 2006b. A feature selection and evaluation scheme for computer virus detection. In Proceedings of ICDM.

Digital Library

[54]

Shif Hou, Aaron Saas, Yanfang Ye, and Lifei Chen. 2016. DroidDelver: An android malware detection system using deep belief network based on API call blocks. In Proceedings of the International Conference on Web-Age Information Management. 54--66.

[55]

Xin Hu. 2011. Large-scale malware analysis, detection, and signature generation. Ph.D. Dissertation, Department of Computer Science and Engineering, University of Michigan.

Digital Library

[56]

Galen Hunt and Doug Brubacher. 1998. Detours: Binary interception of win32 functions. In Proceedings of the 3rd USENIX Windows NT Symposium.

Digital Library

[57]

IDAPro. 2016. The Interactive Disassembler. Retrieved from https://www.hex-rays.com/products/ida/support/download_freeware.shtml.

[58]

Nwokedi Idika and Aditya P. Mathur. 2007. A survey of malware detection techniques. Research Report in Purdue University (2007).

[59]

Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbor: Towards removing the curse of dimensionality. In Proceedings of 30th Annual ACM Symposium on Theory of Computing.

Digital Library

[60]

Virtualization Technology Intel. 2013. Retrieved from http://www.intel.com/technology/virtualization.

[61]

Rafiqul Islam, Ronghua Tian, Lynn M. Batten, and Steve Versteeg. 2013. Classification of malware based on integrated static and dynamic features. Journal of Network and Computer Application 36, 2 (2013), 646--656.

Digital Library

[62]

ITU. 2014. ITU releases 2014 ICT figures. Retrieved from https://www.itu.int/net/pressoffice/press_releases/2014/23.aspx.

[63]

Anil K. Jain, Robert P. W. Duin, and Jianchang Mao. 2000. Statistical pattern recognition: A review. IEEE Trans. Pattern Anal. Mach. Intell. 22, 1 (2000), 4--37.

Digital Library

[64]

Xuxian Jiang, Dongyan Xu, Helen Wang, and Eugene Spafford. 2005. Virtual playgrounds for worm behavior investigation. In Proceedings of the 8th International Symposium on Recent Advances in Intrusion Detection.

Digital Library

[65]

Thorsten Joachims. 1998. Making large-scale support vector machine learning practical. Advances in Kernel Methods: Support Vector Machines (1998).

Digital Library

[66]

George H. John and Pat Langley. 1995. Estimating continuous distributions in Bayesian classifiers. In Proceedings of the Conference on Uncertainty in Artificial Intelligence.

Digital Library

[67]

Min Gyung Kang, Pongsin Poosankam, and Heng Yin. 2007. Renovo: A hidden code extractor for packed executables. In Proceedings of the 5th ACM Workshop on Recurring Malcode (WORM).

Digital Library

[68]

Chris Kanich, Christian Kreibich, Kirill Levchenko, Brandon Enright, Vern Paxson, Geoffrey M. Voelker, and Stefan Savage. 2008. Spamalytics: An empirical analysis of spam marketing conversion. In Proceedings of the 15th ACM Conference on Computer and Communications Security (CCS).

Digital Library

[69]

Nikos Karampatziakis, Jack W. Stokes, Anil Thomas, and Mady Marinescu. 2013. Using file relationships in malware classification. In Proceedings of the Conference on Detection of Intrusions and Malware and Vulnerability Assessment.

Digital Library

[70]

Md Enamul Karim, Andrew Walenstein, Arun Lakhotia, and Laxmi Parida. 2005. Malware phylogeny generation using permutations of code. Journal in Computer Virology 1, 1--2 (2005), 13--23.

[71]

Kaspersky. 2015. The Great Bank Robbery. Retrieved from http://www.kaspersky.com/about/news/virus/2015/Carbanak-cybergang-steals-1-bn-USDfrom-100-financial-institutions-worldwide.

[72]

Kris Kendall and Chad McMillan. 2007. Practical Malware Analysis. Retrieved from https://www.blackhat.com/presentations/bh-dc-07/Kendall_McMillan/Presentation/bh-dc-07-Kendall_McMillan.pdf.

[73]

Kingsoft. 2014. 2013-2014 Internet Security Report in China. Retrieved from http://www.ijinshan.com/news/2014011401.shtml.

[74]

Kingsoft. 2015. 2014-2015 Internet Security Research Report in China. Retrieved from http://www.cssn.cn/xwcbx/xwcbx_gcsy/201501/P020150122566733317860.pdf.

[75]

Kingsoft. 2016. 2015-2016 Internet Security Research Report in China. Retrieved from http://cn.cmcm.com/news/media/2016-01-14/60.html.

[76]

Clemens Kolbitsch, Paolo Milani Comparetti, Christopher Kruegel, Engin Kirda, Xiaoyong Zhou, and XiaoFengWang. 2009. Effective and efficient malware detection at the end host. In Proceedings of the 18th Conference on USENIX Security Symposium.

Digital Library

[77]

Jeremy Z. Kolter and Marcus A. Maloof. 2004. Learning to detect malicious executables in the wild. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD).

Digital Library

[78]

J. Zico Kolter and Marcus A. Maloof. 2006. Learning to detect and classify malicious executables in the wild. Journal of Machine Learning Research 7 (Dec. 2006), 2721--2744.

Digital Library

[79]

Nojun Kwak and Chong-Ho Choi. 2002. Input feature selection by mutual information based on parzen window. IEEE Trans. Pattern Anal. Mach. Intell. 24, 12 (2002), 1667--1671.

Digital Library

[80]

Pat Langley. 1994. Selection of relevant features in machine learning. In Proceedings of AAAI Fall Symposium.

[81]

Andrea Lanzi, Monirul Sharif, and Wenke Lee. 2009. K-Tracer: A system for extracting kernel malware behavior. In Proceedings of the 16th Annual Network and Distributed System Security Symposium (NDSS).

[82]

Tony Lee and Jigar J. Mody. 2006. Behavioral classification. In Proceedings of the European Institute for Computer Antivirus Research Conference (EICAR).

[83]

David D. Lewis and William A. Gale. 1994. A sequential algorithm for training text classifiers. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Springer-Verlag, New York, Inc., 3--12.

Digital Library

[84]

Shengqiao Li, E. James Harner, and Donald A. Adjeroh. 2011. Random KNN feature selection - A fast and stable alternative to random forests. BMC Bioinformatics 12, 1 (2011), 450.

[85]

Shengqiao Li, E. James Harner, and Donald A. Adjeroh. 2014. Random KNN. In Proceedings of the 2014 IEEE International Conference on Data Mining Workshops. 629--636.

[86]

Tao Li (Ed.). 2015. Event Mining: Algorithms and Applications. CRC Press.

Digital Library

[87]

LordPE. 2013. PE Tools - LordPE. Retrieved from http://www.malware-analyzer.com/pe-tools.

[88]

Mike Loukides and Andy Oram. 1996. Getting to know gdb. Linux Journal (1996).

Digital Library

[89]

James Lyne. 2014. Security threat trends 2015. Retrieved from https://www.sophos.com/threat-center/medialibrary/PDFs/other/sophos-trends-and-predictions-2015.pdf.

[90]

Mohammad M. Masud, Tahseen Al-Khateeb, Kevin W. Hamlen, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani M. Thuraisingham. 2011. Cloud-based malware detection for evolving data streams. ACM Trans. Management Inf. Syst. 2, 3 (2011), 16.

Digital Library

[91]

Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham. 2008. Mining concept-drifting data stream to detect peer to peer botnet traffic. Tech. rep. UTDCS-05-08, The University of Texas at Dallas, Richardson (2008).

[92]

Mohammad M. Masud, Latifur Khan, and Bhavani Thuraisingham. 2007. A scalable multi-level feature extraction technique to detect malicious executables. Information Systems Frontiers 10, 1 (2007), 33--45.

Digital Library

[93]

Kirti Mathur and Saroj Hiranwal. 2013. A survey on techniques in detection and analyzing malware executables. International Journal of Advanced Research in Computer Science and Software Engineering 3, 4 (2013), 422--428.

[94]

Micropoint. 2008. Micropoint Antivirus. Retrieved from http://www.micropoint.com.cn/Channel//20080626114608.html.

[95]

David Moore and Colleen Shannon. 2002. Code-red: A case study on the spread and victims of an internet worm. In Proceedings of the Internet Measurement Workshop.

Digital Library

[96]

Andreas Moser, Christopher Kruegel, and Engin Kirda. 2007. Limits of static analysis for malware detection. In Proceedings of the 23rd Annual Computer Security Applications Conference (ACSAC).

[97]

Robert Moskovitch, Clint Feher, and Yuval Elovici. 2009. A chronological evaluation of unknown malcode detection. LNCS: Intelligence and Security Informatics 5477 (2009), 112--117.

Digital Library

[98]

Robert Moskovitch, Clint Feher, Nir Tzachar, Eugene Berger, Marina Gitelman, Shlomi Dolev, and Yuval Elovici. 2008a. Unknown malcode detection using OPCODE representation. In Proceedings of the European Conference on Intelligence and Security Informatics (EuroISI).

Digital Library

[99]

Robert Moskovitch, Nir Nissim, and Yuval Elovici. 2008b. Acquisition of malicious code using active learning. In PinKDD.

[100]

Robert Moskovitch, Dima Stopel, Clint Feher, Nir Nissim, and Yuval Elovici. 2008c. Unknown malcode detection via text categorization and the imbalance problem. In IEEE Intelligence and Security Informatics.

Digital Library

[101]

Kevin P. Murphy. 2012. Machine learning: A probabilistic perspective. In The MIT Press, Cambridge, Massachusetts.

Digital Library

[102]

Ion Muslea, Steven Minton, and Craig A. Knoblock. 2006. Active learning with multiple views. Journal of Artificial Intelligence Research 27 (2006), 203--233.

[103]

Carey Nachenberg and Vijay Seshadri. 2010. An analysis of real-world effectiveness of reputation-based security. In Proceedings of the Virus Bulletin Conference (VB).

[104]

Nicholas Nethercote and Julian Seward. 2007. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation.

Digital Library

[105]

Hieu T. Nguyen and Arnold Smeulders. 2004. Active learning using pre-clustering. In Proceedings of the 21st International Conference on Machine Learning. ACM, 79.

Digital Library

[106]

Ming Ni, Tao Li, Qianmu Li, Hong Zhang, and Yanfang Ye. 2016. FindMal: A file-to-file social network based malware detection framework. Knowledge-Based Systems 112 (2016), 142--151.

Digital Library

[107]

Corporation of Compuware. 1999. Debugging blue screens. Technical Paper (September 1999).

[108]

Gunter Ollmann. 2010. Serial variant evasion tactics techniques used to automatically bypass antivirus technologies. Retrieved from http://www.damballa.com/downloads/rpubs/WPSerialVariantEvasionTactics.pdf.

[109]

OllyDump. 2006. PE Tools - OllyDump. Retrieved from http://www.openrce.org/downloads/details/108/OllyDump.

[110]

David Orenstein. 2000. Application programming interface (API). In Quick Study: Application Programming Interface (API).

[111]

Nikunj C. Oza and Stuart Russell. 2001. Experimental comparisons of online and batch versions of bagging and boosting. In Proceedings of SIGKDD.

Digital Library

[112]

Judea Pearl. 1987. Evidential reasoning using stochastic simulation of causal models. Artificial Intelligence 32, 2 (1987), 245--258.

Digital Library

[113]

Hanchuan Peng, Fuhui Long, and Chris Ding. 2005. Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 8 (2005), 1226--1238.

Digital Library

[114]

Simon Perkins, Kevin Lacker, and James Theiler. 2003. Grafting: Fast incremental feature selection by gradient descent in function space. JMLR 3 (March 2003), 1333--1356.

Digital Library

[115]

Qemu. 2016. (2016). http://www.qemu-project.org/index.html.

[116]

Internet Security Center Qihoo. 2015. 2014 Internet Security Research Report in China. Retrieved from http://zt.360.cn/report/.

[117]

J. Ross Quinlan. 1986. Induction of decision trees. Machine Learning 1, 1 (1986), 81--106.

[118]

J. Ross Quinlan. 1993. C4.5: Programs for Machine Learning. San Francisco, CA: Morgan Kaufmann Publishers, Inc. (1993).

Digital Library

[119]

Alain Rakotomamonjy. 2003. Variable selection using SVM-based criteria. JMLR 3 (March 2003), 1357--1370.

Digital Library

[120]

Zulfikar Ramzan, Vijay Seshadri, and Carey Nachenberg. 2013. Reputation-based security: An analysis of real world effectiveness. In Symantec Security Response.

[121]

Rizwan Rehmani, G. C. Hazarika, and Gunadeep Chetia. 2011. Malware threats and mitigation strategies: A Survey. Journal of Theoretical and Applied Information Technology 29, 2 (2011), 69--73.

[122]

John Robbins. 1999. Debugging windows based applications using windbg. Microsoft Systems Journal (1999).

[123]

Lior Rokach. 2010. Ensemble-based classifiers. Artif Intell Rev 33, 1 (2010), 1--39.

Digital Library

[124]

Paul Royal, Mitch Halpin, David Dagon, Robert Edmonds, and Wenke Lee. 2006. PolyUnpack: Automating the hidden-code extraction of unpack-executing malware. In Proceedings of the 22nd Annual Computer Security Applications Conference.

Digital Library

[125]

Yvan Saeys, Inaki Inza, and Pedro Larranaga. 2007. A review of feature selection techniques in bioinformatics. Bioinformatics 23, 19 (2007), 2507--2517.

Digital Library

[126]

Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975. A vector space model for automatic indexing. Commun. ACM 18, 11 (1975), 613--620.

Digital Library

[127]

Igor Santos, Jaime Devesa, Felix Brezo, Javier Nieves, and Pablo Garcia Bringas. 2013. OPEM: A static-dynamic approach for machine learning based malware detection. In Proceedings of International Conference CISIS-ICEUTE, Special Sessions Advances in Intelligent Systems and Computing.

[128]

Igor Santos, Carlos Laorden, and Pablo G. Bringas. 2011a. Collective classification for unknown malware detection. In Proceedings of the International Conference on Security and Cryptography.

[129]

Igor Santos, Javier Nieves, and Pablo G. Bringas. 2011b. Semi-supervised learning for unknown malware detection. In International Symposium on Distributed Computing and Artificial Intelligence Advances in Intelligent and Soft Computing.

[130]

Joshua Saxe and Konstantin Berlin. 2015. Deep neural network based malware detection using two dimensional binary program features. In Proceedings of the 10th International Conference on Malicious and Unwanted Software (MALWARE).

Digital Library

[131]

Matthew G. Schultz, Eleazar Eskin, F. Zadok, and Salvatore J. Stolfo. 2001. Data mining methods for detection of new malicious executables. In Proc. of the IEEE Symposium on Security and Privacy.

Digital Library

[132]

Fabrizio Sebastiani. 2002. Text categorization. Comput. Surveys 34, 1 (2002), 1--47.

Digital Library

[133]

H. Sebastian Seung, Manfred Opper, and Haim Sompolinsky. 1992. Query by committee. In Proceedings of the 5th Annual Workshop on Computational Learning Theory. ACM, 287--294.

Digital Library

[134]

Asaf Shabtai, Robert Moskovitch, Yuval Elovici, and Chanan Glezer. 2009. Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey. Information Security Technical Report 14, 1 (2009), 16--29.

[135]

Muazzam Siddiqui, Morgan C. Wang, and Joohan Lee. 2008. A survey of data mining techniques for malware detection using file features. In Proceedings of ACM-SE.

Digital Library

[136]

Muazzam Siddiqui, Morgan C. Wang, and Joohan Lee. 2009. Detecting internet worms using data mining techniques. Journal of Systemics, Cybernetics and Informatics 6, 6 (2009), 48--53.

[137]

Dawn Song, David Brumley, Heng Yin, Juan Caballero, Ivan Jager, Min Gyung Kang, Zhenkai Liang, James Newsome, Pongsin Poosankam, and Prateek Saxena. 2008. BitBlaze: A new approach to computer security via binary analysis. In Proceedings of the 4th International Conference on Information Systems Security.

Digital Library

[138]

Eugene H. Spafford. 1989. The internet worm incident. In Proceedings of the 2nd European Software Engineering Conference.

Digital Library

[139]

Elizabeth Stinson and John C. Mitchell. 2007. Characterizing bots’ remote control behavior. LNCS: Detection of Intrusions and Malware, and Vulnerability Assessment 4579 (2007), 89--108.

Digital Library

[140]

Jack W. Stokes, John C. Platt, Helen J. Wang, Joe Faulhaber, Jonathan Keller, Mady Marinescu, Anil Thomas, and Marius Gheorghescu. 2012. Scalable telemetry classification for automated malware detection. Computer Security - ESORICS (2012).

[141]

Andrew H. Sung, Jianyun Xu, Patrick Chavez, and Srinivas Mukkamala. 2004. Static analyzer of vicious executables (SAVE). In Proceedings of the 20th Annual Computer Security Applications Conference.

Digital Library

[142]

Symantec. 2008. Symantec global internet security threat report. Retrieved from http://eval.symantec.com/mktginfo/enterprise/white_papers/b-whitepaper_internet_security_threat_report_xiii_04-2008.en-us.pdf.

[143]

Symantec. 2014a. Internet Security Threat Report 2014. Retrieved from http://www.symantec.com/security_response/publications/threatreport.jsp.

[144]

Symantec. 2014b. The Threat Landscape in 2014 and Beyond: Symantec and Norton Predictions for 2015, Asia Pacific and Japan. Retrieved from http://www.symantec.com/connect/blogs/threat-landscape-2014-and-beyond-symantec-and-norton-predictions-2015-asia-pacific-japan.

[145]

Symantec. 2016. Internet Security Threat Report. Retrieved from https://www.symantec.com/content/dam/symantec/docs/reports/istr-21-2016-en.pdf.

[146]

Kimberly Tam, Ali Feizollah, Nor Badrul Anuar, Rosli Salleh, and Lorenzo Cavallaro. 2017. The evolution of android malware and android analysis techniques. ACM Computing Surveys (CSUR) 49, 4 (2017), 76.

Digital Library

[147]

Acar Tamersoy, Kevin Roundy, and Duen Horng Chau. 2014. Guilt by association: Large scale malware detection by mining file-relation graphs. In Proccedings of ACM International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD).

Digital Library

[148]

Fadi Abdeljaber Thabtah. 2007. A review of associative classification mining. Knowledge Engineering Review 22, 1 (2007), 37--65.

Digital Library

[149]

Ronghua Tian, Rafiqul Islam, Lynn Batten, and Steve Versteeg. 2010. Differentiating malware from cleanwares using behavioral analysis. In Proceedings of 5th International Conference on Malicious and Unwanted Software (Malware).

[150]

TrendLabs. 2014. The invisible becomes visible: Trend micro security predictions for 2015 and beyond. (2014). http://www.trendmicro.com/cloud-content/us/pdfs/security-intelligence/reports/rpt-the-invisible-becomes-visible.pdf.

[151]

Trend Threat Research Team TrendMicro. 2010. Zeus: A Persistent Criminal Enterprise. Retrieved from http://www.trendmicro.com/cloud-content/us/pdfs/security-intelligence/white-papers/wp_zeuspersistent-criminal-enterprise.pdf.

[152]

Amit Vasudevan and Ramesh Yerraballi. 2005. Stealth breakpoints. In Proceedings of the 21st Annual Computer Security Applications Conference.

Digital Library

[153]

Amit Vasudevan and Ramesh Yerraballi. 2006. Cobra: Fine-grained malware analysis using stealth localized-executions. In Proceedings of 2006 IEEE Symposium on Security and Privacy.

Digital Library

[154]

Shobha Venkataraman, Avrim Blum, and Dawn Song. 2008. Limits of learning-based signature generation with adversaries. In NDSS.

[155]

Andrei Venzhega, Polina Zhinalieva, and Nikolay Suboch. 2013. Graph-based malware distributors detection. In Proceedings of the 22nd International Conference on World Wide Web Companion (WWW).

Digital Library

[156]

Randall Wald, Taghi M. Khoshgoftaar, and Amri Napolitano. 2013. Comparison of stability for different families of filter-based and wrapper-based feature selection. In ICMLA.

Digital Library

[157]

Tzu-Yen Wang, Shi-Jinn Horng, Ming-Yang Su, Chin-Hsiung Wu, Peng-Chu Wang, and Wei-Zen Su. 2006b. A surveillance spyware detection system based on data mining methods. Evolutionary Computation (2006), 3236--3241.

[158]

Yi-Min Wang, Doug Beck, Xuxian Jiang, and Roussi Roussev. 2006a. Automated web patrol with strider honeymonkeys: Finding web sites that exploit browser vulnerabilities. In NDSS.

[159]

SECURITY LABS WEBSENSE. 2014. 2015 Security Predictions. Retrieved from http://www.websense.com/assets/reports/report-2015-security-predictions-en.pdf.

[160]

Paul Werbos. 1974. Beyond regression: New tools for prediction and analysis in the behavioral science. Ph.D. Dissertation, Harvard University.

[161]

Wikipedia. 2016. Scareware. Retrieved from https://en.wikipedia.org/wiki/Scareware.

[162]

Wikipedia. 2017a. Assembly Language. Retrieved from http://en.wikipedia.org/wiki/Assembly_language.

[163]

Wikipedia. 2017b. Computer Virus. Retrieved from http://en.wikipedia.org/wiki/Computer_virus.

[164]

Wikipedia. 2017c. Morris Worm. Retrieved from http://en.wikipedia.org/wiki/Morris_worm.

[165]

Wikipedia. 2017d. Ransomware. Retrieved from https://en.wikipedia.org/wiki/Ransomware.

[166]

Wikipedia. 2017e. Rootkit. Retrieved from http://en.wikipedia.org/wiki/Rootkit.

[167]

Wikipedia. 2017f. Zero-day (computing). Retrieved from https://en.wikipedia.org/wiki/Zero-day_(computing).

[168]

Wikipedia. 2017g. Zeus (malware). Retrieved from http://en.wikipedia.org/wiki/Zeus_(malware).

[169]

Carsten Willems, Thorsten Holz, and Felix Freiling. 2007. Toward automated dynamic malware analysis using cwsandbox. In IEEE Security and Privacy.

Digital Library

[170]

Rui Xu and Donald Wunsch. 2005. Survey of clustering algorithms. In IEEE Transactions on Neural Networks 16, 3 (2005), 645--678.

Digital Library

[171]

Yanfang Ye. 2010. Research on intelligent malware detection methods and their applications. Ph.D. Dissertation, Department of Computer Science, Xiamen University (2010).

[172]

Yanfang Ye, Lifei Chen, Dingding Wang, Tao Li, Qingshan Jiang, and Min Zhao. 2009. SBMDS: An interpretable string based malware detection system using SVM ensemble with bagging. Journal in Computer Virology 5, 4 (2009), 283--293.

[173]

Yanfang Ye, Tao Li, Yong Chen, and Qingshan Jiang. 2010. Automatic malware categorization using cluster ensemble. In Proccedings of ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD).

Digital Library

[174]

Yanfang Ye, Tao Li, Kai Huang, Qingshan Jiang, and Yong Chen. 2009a. Hierarchical associative classifier (HAC) for malware detection from the large and imbalanced gray list. Journal of Intelligent Information Systems 35, 1 (2009), 1--20.

Digital Library

[175]

Yanfang Ye, Tao Li, Qingshan Jiang, Zhixue Han, and Li Wan. 2009c. Intelligent file scoring system for malware detection from the gray list. In Proccedings of ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD).

Digital Library

[176]

Yanfang Ye, Tao Li, Qingshan Jiang, and Youyu Wang. 2009b. CIMDS: Adapting post-processing techniques of associative classification for malware detection system. IEEE Transactions on Systems, Man, and Cybernetics 40, 3 (2009), 298--307.

Digital Library

[177]

Yanfang Ye, Tao Li, Shenghuo Zhu, Weiwei Zhuang, Egemen Tas, Umesh Gupta, and Melih Abdulhayoglu. 2011. Combining file content and file relations for cloud based malware detection. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD).

Digital Library

[178]

Yanfang Ye, Dingding Wang, Tao Li, and Dongyi Ye. 2007. IMDS: Intelligent malware detection system. In Proccedings of ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD).

Digital Library

[179]

Yanfang Ye, Dingding Wang, Tao Li, Dongyi Ye, and Qingshan Jiang. 2008. An intelligent PE-malware detection system based on association mining. Journal in Computer Virology 4, 4 (2008), 323--334.

[180]

Jonathan S. Yedidia, William T. Freeman, and Yair Weiss. 2001. Understanding belief propagation and its generalizations. In Mitsubishi Electric Research Laboratories.

[181]

Heng Yin, Dawn Song, Manuel Egele, Christopher Kruegel, and Engin Kirda. 2007. Panorama: Capturing system-wide information flow for malware detection and analysis. In Proceedings of the 14th ACM Conference on Computer and Communications Security (CCS).

Digital Library

[182]

Chunqiu Zeng, Liang Tang, Wubai Zhou, Tao Li, Larisa Shwartz, and Genady Ya.Grabarnik. 2017. An integrated framework for mining temporal logs from fluctuating events. IEEE Transactions on Services Computing (TSC) (2017). In Press.

[183]

Boyun Zhang, Jianping Yin, Jingbo Hao, Dingxing Zhang, and Shulin Wang. 2007. Malicious codes detection based on ensemble learning. Autonomic and Trusted Computing (2007).

Digital Library

[184]

Jianwei Zhuge, Thorsten Holz, Chengyu Song, Jinpeng Guo, Xinhui Han, and Wei Zou. 2008. Studying malicious websites and the underground economy on the Chinese web. In Proceedings of the 7th Workshop on Economics of Information Security.

Cited By

Sedhuramalingam KSaravana Kumar D(2024)A Hybrid Rider Optimization with Deep Learning Driven Intrusion Detection Farmwork in Wireless Sensor NetworkSalud, Ciencia y Tecnología - Serie de Conferencias10.56294/sctconf20247623(762)Online publication date: 13-May-2024
https://doi.org/10.56294/sctconf2024762
Omar M(2024)Revolutionizing Malware DetectionInnovations, Securities, and Case Studies Across Healthcare, Business, and Technology10.4018/979-8-3693-1906-2.ch011(196-220)Online publication date: 12-Apr-2024
https://doi.org/10.4018/979-8-3693-1906-2.ch011
Wu WPeng HZhu HZhang D(2024)CSMC: A Secure and Efficient Visualized Malware Classification Method Inspired by Compressed SensingSensors10.3390/s2413425324:13(4253)Online publication date: 30-Jun-2024
https://doi.org/10.3390/s24134253
Show More Cited By

Index Terms

A Survey on Malware Detection Using Data Mining Techniques
1. Computing methodologies
  1. Machine learning
2. Security and privacy
  1. Systems security
    1. Operating systems security

Recommendations

Opcode sequences as representation of executables for data-mining-based unknown malware detection

Malware can be defined as any type of malicious code that has the potential to harm a computer or network. The volume of malware is growing faster every year and poses a serious global security threat. Consequently, malware detection has become a ...
Malware detection using adaptive data compression
AISec '08: Proceedings of the 1st ACM workshop on Workshop on AISec

A popular approach in current commercial anti-malware software detects malicious programs by searching in the code of programs for scan strings that are byte sequences indicative of malicious code. The scan strings, also known as the signatures of ...
A state-of-the-art survey of malware detection approaches using data mining techniques

Data mining techniques have been concentrated for malware detection in the recent decade. The battle between security analyzers and malware scholars is everlasting as innovation grows. The proposed methodologies are not adequate while evolutionary and ...

Reviews

Reviewer: Klerisson Paixao

It is not new that software is eating the world [1]. Industries and businesses everywhere are being "softwareized." Meanwhile, we cannot deny that malware (malicious software) is also having a feast. This paper provides a comprehensive survey of existing technology for malware detection focused on data mining techniques. It starts with a taxonomy, primarily based on common types of malware: viruses, worms, Trojans, spyware, ransomware, scareware, bots, rootkits, and hybrid malware. Then, the paper describes the current state of the (anti-)malware industry. The study is a bit short on the data mining techniques used. The authors restrain their efforts to describing detections relying on classification and clustering algorithms. On the other hand, it does a very good job at summarizing dozens of methods used in the literature. Further, the authors suggest new ideas for future research directions. Notably, they discuss the application of active learning to the task. Such a technique seems more appropriate to deal with a critical problem in the field: data scarcity. While cybercriminals usually cooperate and collaborate to build their malware, their counterparts keep collections of cybercrime data under lock. The paper ends with a clear conclusion: there is no silver bullet when it comes to malware detection. All classification/clustering techniques have their pros and cons; thus, they will not always perform optimally. This survey serves well as a starting point and initial set of guidelines for people willing to do research in this field. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys

ACM Computing Surveys Volume 50, Issue 3

May 2018

550 pages

ISSN:0360-0300

EISSN:1557-7341

DOI:10.1145/3101309

Editor:
Sartaj Sahni
Department of Computer and Information Science and Engineering / University of Florida / Gainesville, FL 32611

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 June 2017

Accepted: 01 March 2017

Revised: 01 November 2016

Received: 01 August 2015

Published in CSUR Volume 50, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Survey
Research
Refereed

Funding Sources

Scientific and Technological Support Project (Society) of Jiangsu
U.S. National Science Foundation
Chinese NSF

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

402
Total Citations
View Citations
10,897
Total Downloads

Downloads (Last 12 months)2,469
Downloads (Last 6 weeks)217

Reflects downloads up to 14 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sedhuramalingam KSaravana Kumar D(2024)A Hybrid Rider Optimization with Deep Learning Driven Intrusion Detection Farmwork in Wireless Sensor NetworkSalud, Ciencia y Tecnología - Serie de Conferencias10.56294/sctconf20247623(762)Online publication date: 13-May-2024
https://doi.org/10.56294/sctconf2024762
Omar M(2024)Revolutionizing Malware DetectionInnovations, Securities, and Case Studies Across Healthcare, Business, and Technology10.4018/979-8-3693-1906-2.ch011(196-220)Online publication date: 12-Apr-2024
https://doi.org/10.4018/979-8-3693-1906-2.ch011
Wu WPeng HZhu HZhang D(2024)CSMC: A Secure and Efficient Visualized Malware Classification Method Inspired by Compressed SensingSensors10.3390/s2413425324:13(4253)Online publication date: 30-Jun-2024
https://doi.org/10.3390/s24134253
Sun EHan JLi YHuang C(2024)A Packet Content-Oriented Remote Code Execution Attack Payload Detection ModelFuture Internet10.3390/fi1607023516:7(235)Online publication date: 2-Jul-2024
https://doi.org/10.3390/fi16070235
Guo WXue JMeng WHan WLiu ZWang YLi Z(2024)MalOSDF: An Opcode Slice-Based Malware Detection Framework Using Active and Ensemble LearningElectronics10.3390/electronics1302035913:2(359)Online publication date: 15-Jan-2024
https://doi.org/10.3390/electronics13020359
Wajahat AHe JZhu NMahmood TNazir AUllah FQureshi SOsman M(2024)An effective deep learning scheme for android malware detection leveraging performance metrics and computational resourcesIntelligent Decision Technologies10.3233/IDT-23028418:1(33-55)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.3233/IDT-230284
Baniya BRush T(2024)Intelligent Anomaly Detection System Based on Ensemble and Deep Learning2024 26th International Conference on Advanced Communications Technology (ICACT)10.23919/ICACT60172.2024.10471923(137-142)Online publication date: 4-Feb-2024
https://doi.org/10.23919/ICACT60172.2024.10471923
Cletus AOpoku AWeyori B(2024)An Evaluation of Current Malware Trends and Defense Techniques: A Scoping Review with Empirical Case StudiesJournal of Advances in Information Technology10.12720/jait.15.5.649-671(649-671)Online publication date: 2024
https://doi.org/10.12720/jait.15.5.649-671
He SFu CHu HChen JLv JJiang SRoychoudhury APaiva AAbreu RStorey M(2024)MalwareTotal: Multi-Faceted and Sequence-Aware Bypass Tactics against Static Malware DetectionProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639141(1-12)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639141
Cui LYin JCui JJi YLiu PHao ZYun X(2024)API2Vec++: Boosting API Sequence Representation for Malware Detection and ClassificationIEEE Transactions on Software Engineering10.1109/TSE.2024.342299050:8(2142-2162)Online publication date: Aug-2024
https://doi.org/10.1109/TSE.2024.3422990
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents