Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
survey
Public Access

A Survey on Malware Detection Using Data Mining Techniques

Published: 29 June 2017 Publication History
  • Get Citation Alerts
  • Abstract

    In the Internet age, malware (such as viruses, trojans, ransomware, and bots) has posed serious and evolving security threats to Internet users. To protect legitimate users from these threats, anti-malware software products from different companies, including Comodo, Kaspersky, Kingsoft, and Symantec, provide the major defense against malware. Unfortunately, driven by the economic benefits, the number of new malware samples has explosively increased: anti-malware vendors are now confronted with millions of potential malware samples per year. In order to keep on combating the increase in malware samples, there is an urgent need to develop intelligent methods for effective and efficient malware detection from the real and large daily sample collection. In this article, we first provide a brief overview on malware as well as the anti-malware industry, and present the industrial needs on malware detection. We then survey intelligent malware detection methods. In these methods, the process of detection is usually divided into two stages: feature extraction and classification/clustering. The performance of such intelligent malware detection approaches critically depend on the extracted features and the methods for classification/clustering. We provide a comprehensive investigation on both the feature extraction and the classification/clustering techniques. We also discuss the additional issues and the challenges of malware detection using data mining techniques and finally forecast the trends of malware development.

    References

    [1]
    Tony Abou-As saleh, Nick Cercone, Vlado Keselj, and Ray Sweidan. 2004. N-gram-based detection of new malicious code. In Proceedings of the 28th Annual International Computer Software and Applications Conference (COMPSAC).
    [2]
    David W. Aha, Dennis Kibler, and Marc K. Albert. 1991. Instance-based learning algorithms. Machine Learning 6, 1 (1991), 37--66.
    [3]
    Blake Anderson, Daniel Quist, Joshua Neil, Curtis Storlie, and Terran Lane. 2011. Graph based malware detection using dynamic analysis. Journal in Computer Virology 4 (2011), 247--258.
    [4]
    Blake Anderson, Curtis Storlie, and Terran Lane. 2012. Improving malware classification: Bridging the static/dynamic gap. In Proceedings of 5th ACM Workshop on Security and Artificial Intelligence (AISec).
    [5]
    Anubis. 2010. Anubis: Analyzing Unknown Binaries. Retrieved from http://anubis.iseclab.org/.
    [6]
    Michael Bailey, Jon Oberheide, Jon Andersen, Z. Morley Mao, Farnam Jahanian, and Jose Nazario. 2007. Automated classification and analysis of internet malware. In Proceedings of the 10th International Conference on Recent Advances in Intrusion Detection.
    [7]
    Ulrich Bayer, Paolo Milani Comparetti, Clemens Hlauschek, Christopher Kruegel, and Engin Kirda. 2009. Scalable, behavior-based malware clustering. In Proceedings of the 16th Annual Network and Distributed System Security Symposium.
    [8]
    Ulrich Bayer, Christopher Kruegel, and Engin Kirda. 2006a. TTAnalyze: A tool for analyzing malware. In EICAR.
    [9]
    Ulrich Bayer, Andreas Moser, Christopher Kruegel, and Engin Kirda. 2006b. Dynamic analysis of malicious code. Journal in Computer Virology 2(1) (2006), 67--77.
    [10]
    Zahra Bazrafshan, Hashem Hashemi, Seyed Mehdi Hazrati Fard, and Ali Hamzeh. 2013. A survey on heuristic malware detection techniques. In Proceedings of the 5th Conference on Information and Knowledge Technology (IKT).
    [11]
    Philippe Beaucamps and ric Filiol. 2007. On the possibility of practically obfuscating programs towards a unified perspective of code protection. Journal in Computer Virology 3, 1 (2007), 3--21.
    [12]
    Yoshua Bengio. 2009. Learning deep architectures for AI. Foundations and Trends in Machine Learning 2, 1 (2009), 1--127.
    [13]
    Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. 2007. Greedy layer-wise training of deep networks. Advances in Neural Information Processing Systems 19 (2007).
    [14]
    Christopher M. Bishop. 1995. Neural networks for pattern recognition. Oxford, Clarendon Press.
    [15]
    Bizjournals. 2011. McAfee: Trends in a decade of cybercrime. Retrieved from http://www.bizjournals.com/sanjose/news/2011/01/25/mcafee-trends-in-a-decade-of-cybercrime.html?page=all.
    [16]
    Kevin Borders and Atul Prakash. 2004. Web tap: Detecting covert web traffic. In Proceedings of the 11th ACM Conference on Computer and Communications Security.
    [17]
    Leo Breiman. 1996. Bagging predicators. Machine Learning 24, 2 (1996), 123--140.
    [18]
    Leo Breiman. 2001. Random forests. Machine Learning 45, 1 (2001), 5--32.
    [19]
    Juan Caballero, Heng Yin, Zhenkai Liang, and Dawn Song. 2007. Polyglot: Automatic extraction of protocol message format using dynamic binary analysis. In Proceedings of the 14th ACM Conference on Computer and Communications Security (CCS).
    [20]
    Duen Horng Chau, Carey Nachenberg, Jeffrey Wilhelm, Adam Wright, and Christos Faloutsos. 2011. Polonium: Tera-scale graph mining for malware detection. In Proceedings of the SIAM International Conference on Data Mining (SDM).
    [21]
    Lingwei Chen, William Hardy, Yanfang Ye, and Tao Li. 2015. Analyzing file-to-file relation network in malware detection. In Proceedings of the International Conference on Web Information Systems Engineering (WISE).
    [22]
    Mihai Christodorescu and Somesh Jha. 2003. Static analysis of executables to detect malicious patterns. In Proceedings of the 12th Conference on USENIX Security Symposium.
    [23]
    Mihai Christodorescu, Somesh Jha, and Christopher Kruegel. 2007. Mining specifications of malicious behavior. In Proceedings of ESEC/FSE.
    [24]
    Mihai Christodorescu, Somesh Jha, Sanjit A. Seshia, Dawn Song, and Randal E. Bryant. 2005. Semantics-aware malware detection. In Proceedings of IEEE Symposium on Security and Privacy.
    [25]
    William W. Cohen. 1995. Fast effective rule induction. In Proceedings of 12th International Conference on Machine Learning.
    [26]
    Peter Coogan. 2010. SpyEye Bot Versus Zeus Bot. Retrieved from http://www.symantec.com/connect/blogs/spyeye-bot-versus-zeus-bot.
    [27]
    Thomas Cover and Peter Hart. 1967. Nearest nieghbor pattern classification. IEEE Transaction on Information Theory IT-13, 1 (1967), 21--27.
    [28]
    Jedidiah R. Crandall, Zhendong Su, S. Felix Wu, and Frederic T. Chong. 2005. On deriving unknown vulnerabilities from zero-day polymorphic and metamorphic worm exploits. In Proceedings of the 12th ACM Conference on Computer and Communications Security (CCS).
    [29]
    Nilesh Dalvi, Pedro Domingos, Mausam, Sumit Sanghai, and Deepak Verma. 2004. Adversarial classification. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 99--108.
    [30]
    Damballa. 2008. 3% to 5% of Enterprise Assets Are Compromised by Bot-Driven Targeted Attack Malware. Retrieved from http://www.prnewswire.com/news-releases/3-to-5-of-enterprise-assets-are-compromised-by-bot-driven-targeted-attack-malware-61634867.html.
    [31]
    Mohsen Damshenas, Ali Dehghantanha, and Ramlan Mahmoud. 2013. A survey on malware propagation, analysis, and detection. International Journal of Cyber-Security and Digital Forensics (IJCSDF) 2, 4 (2013), 10--29.
    [32]
    Sanjeev Das, Yang Liu, Wei Zhang, and Mahintham Chandramohan. 2016. Semantics-based online malware detection: Towards efficient real-time protection against malware. IEEE Transactions on Information Forensics and Security 11, 2 (2016), 289--302.
    [33]
    Thomas Dietterich. 1997. Machine learning research: Four current directions. Artificial Intelligence Magzine 18, 4 (1997), 97--36.
    [34]
    Thomas G. Dietterich. 2000. Ensemble methods in machine learning. In Proceedings of the 1st International Workshop on Multiple Classifier Systems.
    [35]
    Artem Dinaburg, Paul Royal, Monirul Sharif, and Wenke Lee. 2008. Ether: Malware analysis via hardware virtualization extensions. In Proceedings of the 15th ACM Conference on Computer and Communications Security (CCS).
    [36]
    Pedro Domingos and Michael Pazzani. 1997. On the optimality of simple Bayesian classifier under zero-one loss. Machine Learning 29, 2--3 (1997), 103--130.
    [37]
    Manuel Egele, Theodoor Scholte, Engin Kirda, and Christopher Kruegel. 2012. A survey on automated dynamic malware-analysis techniques and tools. ACM Computing Surveys (CSUR) 44, 2 (2012), 6.
    [38]
    Yuval Elovici, Asaf Shabtai, Robert Moskovitch, Gil Tahan, and Chanan Glezer. 2007. Applying machine learning techniques for detection of malicious code in network traffic. KI: Advances in Artificial Intelligence (2007).
    [39]
    EMarketer. 2014. Global B2C Ecommerce Sales to Hit $1.5 Trillion This Year Driven by Growth in Emerging Markets. Retrieved from http://www.emarketer.com/Article/Global-B2C-Ecommerce-Sales-Hit-15-Trillion-This-Year-Driven-by-Growth-Emerging-Markets/1010575.
    [40]
    Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Gomes Amorim. 2014. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research 15, 1 (2014), 3133--3181.
    [41]
    Eric Filiol, Gregoire Jacob, and Mickael Le Liard. 2007. Evaluation methodology and theoretical model for antiviral behavioural detection strategies. Journal in Computer Virology 3, 1 (2007), 23--37.
    [42]
    Ivan Firdausi, Alva Erwin, and Anto Satriyo Nugroho. 2010. Analysis of machine learning techniques used in behavior based malware detection. In Proceedings of 2nd International Conference on Advances in Computing, Control and Telecommunication Technologies (ACT).
    [43]
    Evelyn Fix and Joseph L. Hodges Jr. 1951. Discriminatory analysis-nonparametric discrimination: Consistency properties. US Air Force, School of Avaiation Medicine, Tech. Rep 4 (1951), 5--32.
    [44]
    Matt Fredrikson, Somesh Jha, Mihai Christodorescu, Reiner Sailer, and Xifeng Yan. 2010. Synthesizing near-optimal malware specifications from suspicious behaviors. In Proceedings of IEEE Symposium on Security and Privacy.
    [45]
    Yoav Freund and Robert E. Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comp. Syst. Sci. 55, 1 (1997), 119--39.
    [46]
    Ekta Gandotra, Divya Bansal, and Sanjeev Sofat. 2014. Malware analysis and classification: A survey. Journal of Information Security 5, 2 (2014), 56--64.
    [47]
    Maria Garnaeva, Victor Chebyshev, Denis Makrushin, Roman Unuchek, and Anton Ivanov. 2014. Kaspersky Security Bulletin 2014. Retrieved from http://securelist.com/analysis/kaspersky-security-bulletin/68010/kaspersky-security-bulletin-2014-overall-statistics-for-2014/.
    [48]
    Todd R. Golub, Donna K. Slonim, Pablo Tamayo, Christine Huard, Michelle Gaasenbeek, Jill P. Mesirov, and Hilary Coller. 1999. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286, 5439 (1999), 531--537.
    [49]
    Isabelle Guyon and Andr Elisseeff. 2003. An introduction to variable and feature selection. Jouranl of Machine Learning Research 3 (March 2003), 1157--1182.
    [50]
    Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter (2009).
    [51]
    William Hardy, Lingwei Chen, Shifu Hou, Yanfang Ye, and Xin Li. 2016. DL4MD: A deep learning framework for intelligent malware detection. In Proceedings of the International Conference on Data Mining (DMIN).
    [52]
    Olivier Henchiri and Nathalie Japkowicz. 2006a. A feature selection and evaluation scheme for computer virus detection. In Proceedings of the 6th International Conference on Data Mining.
    [53]
    Olivier Henchiri and Nathalie Japkowicz. 2006b. A feature selection and evaluation scheme for computer virus detection. In Proceedings of ICDM.
    [54]
    Shif Hou, Aaron Saas, Yanfang Ye, and Lifei Chen. 2016. DroidDelver: An android malware detection system using deep belief network based on API call blocks. In Proceedings of the International Conference on Web-Age Information Management. 54--66.
    [55]
    Xin Hu. 2011. Large-scale malware analysis, detection, and signature generation. Ph.D. Dissertation, Department of Computer Science and Engineering, University of Michigan.
    [56]
    Galen Hunt and Doug Brubacher. 1998. Detours: Binary interception of win32 functions. In Proceedings of the 3rd USENIX Windows NT Symposium.
    [57]
    IDAPro. 2016. The Interactive Disassembler. Retrieved from https://www.hex-rays.com/products/ida/support/download_freeware.shtml.
    [58]
    Nwokedi Idika and Aditya P. Mathur. 2007. A survey of malware detection techniques. Research Report in Purdue University (2007).
    [59]
    Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbor: Towards removing the curse of dimensionality. In Proceedings of 30th Annual ACM Symposium on Theory of Computing.
    [60]
    Virtualization Technology Intel. 2013. Retrieved from http://www.intel.com/technology/virtualization.
    [61]
    Rafiqul Islam, Ronghua Tian, Lynn M. Batten, and Steve Versteeg. 2013. Classification of malware based on integrated static and dynamic features. Journal of Network and Computer Application 36, 2 (2013), 646--656.
    [62]
    ITU. 2014. ITU releases 2014 ICT figures. Retrieved from https://www.itu.int/net/pressoffice/press_releases/2014/23.aspx.
    [63]
    Anil K. Jain, Robert P. W. Duin, and Jianchang Mao. 2000. Statistical pattern recognition: A review. IEEE Trans. Pattern Anal. Mach. Intell. 22, 1 (2000), 4--37.
    [64]
    Xuxian Jiang, Dongyan Xu, Helen Wang, and Eugene Spafford. 2005. Virtual playgrounds for worm behavior investigation. In Proceedings of the 8th International Symposium on Recent Advances in Intrusion Detection.
    [65]
    Thorsten Joachims. 1998. Making large-scale support vector machine learning practical. Advances in Kernel Methods: Support Vector Machines (1998).
    [66]
    George H. John and Pat Langley. 1995. Estimating continuous distributions in Bayesian classifiers. In Proceedings of the Conference on Uncertainty in Artificial Intelligence.
    [67]
    Min Gyung Kang, Pongsin Poosankam, and Heng Yin. 2007. Renovo: A hidden code extractor for packed executables. In Proceedings of the 5th ACM Workshop on Recurring Malcode (WORM).
    [68]
    Chris Kanich, Christian Kreibich, Kirill Levchenko, Brandon Enright, Vern Paxson, Geoffrey M. Voelker, and Stefan Savage. 2008. Spamalytics: An empirical analysis of spam marketing conversion. In Proceedings of the 15th ACM Conference on Computer and Communications Security (CCS).
    [69]
    Nikos Karampatziakis, Jack W. Stokes, Anil Thomas, and Mady Marinescu. 2013. Using file relationships in malware classification. In Proceedings of the Conference on Detection of Intrusions and Malware and Vulnerability Assessment.
    [70]
    Md Enamul Karim, Andrew Walenstein, Arun Lakhotia, and Laxmi Parida. 2005. Malware phylogeny generation using permutations of code. Journal in Computer Virology 1, 1--2 (2005), 13--23.
    [71]
    Kaspersky. 2015. The Great Bank Robbery. Retrieved from http://www.kaspersky.com/about/news/virus/2015/Carbanak-cybergang-steals-1-bn-USDfrom-100-financial-institutions-worldwide.
    [72]
    Kris Kendall and Chad McMillan. 2007. Practical Malware Analysis. Retrieved from https://www.blackhat.com/presentations/bh-dc-07/Kendall_McMillan/Presentation/bh-dc-07-Kendall_McMillan.pdf.
    [73]
    Kingsoft. 2014. 2013-2014 Internet Security Report in China. Retrieved from http://www.ijinshan.com/news/2014011401.shtml.
    [74]
    Kingsoft. 2015. 2014-2015 Internet Security Research Report in China. Retrieved from http://www.cssn.cn/xwcbx/xwcbx_gcsy/201501/P020150122566733317860.pdf.
    [75]
    Kingsoft. 2016. 2015-2016 Internet Security Research Report in China. Retrieved from http://cn.cmcm.com/news/media/2016-01-14/60.html.
    [76]
    Clemens Kolbitsch, Paolo Milani Comparetti, Christopher Kruegel, Engin Kirda, Xiaoyong Zhou, and XiaoFengWang. 2009. Effective and efficient malware detection at the end host. In Proceedings of the 18th Conference on USENIX Security Symposium.
    [77]
    Jeremy Z. Kolter and Marcus A. Maloof. 2004. Learning to detect malicious executables in the wild. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD).
    [78]
    J. Zico Kolter and Marcus A. Maloof. 2006. Learning to detect and classify malicious executables in the wild. Journal of Machine Learning Research 7 (Dec. 2006), 2721--2744.
    [79]
    Nojun Kwak and Chong-Ho Choi. 2002. Input feature selection by mutual information based on parzen window. IEEE Trans. Pattern Anal. Mach. Intell. 24, 12 (2002), 1667--1671.
    [80]
    Pat Langley. 1994. Selection of relevant features in machine learning. In Proceedings of AAAI Fall Symposium.
    [81]
    Andrea Lanzi, Monirul Sharif, and Wenke Lee. 2009. K-Tracer: A system for extracting kernel malware behavior. In Proceedings of the 16th Annual Network and Distributed System Security Symposium (NDSS).
    [82]
    Tony Lee and Jigar J. Mody. 2006. Behavioral classification. In Proceedings of the European Institute for Computer Antivirus Research Conference (EICAR).
    [83]
    David D. Lewis and William A. Gale. 1994. A sequential algorithm for training text classifiers. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Springer-Verlag, New York, Inc., 3--12.
    [84]
    Shengqiao Li, E. James Harner, and Donald A. Adjeroh. 2011. Random KNN feature selection - A fast and stable alternative to random forests. BMC Bioinformatics 12, 1 (2011), 450.
    [85]
    Shengqiao Li, E. James Harner, and Donald A. Adjeroh. 2014. Random KNN. In Proceedings of the 2014 IEEE International Conference on Data Mining Workshops. 629--636.
    [86]
    Tao Li (Ed.). 2015. Event Mining: Algorithms and Applications. CRC Press.
    [87]
    LordPE. 2013. PE Tools - LordPE. Retrieved from http://www.malware-analyzer.com/pe-tools.
    [88]
    Mike Loukides and Andy Oram. 1996. Getting to know gdb. Linux Journal (1996).
    [89]
    James Lyne. 2014. Security threat trends 2015. Retrieved from https://www.sophos.com/threat-center/medialibrary/PDFs/other/sophos-trends-and-predictions-2015.pdf.
    [90]
    Mohammad M. Masud, Tahseen Al-Khateeb, Kevin W. Hamlen, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani M. Thuraisingham. 2011. Cloud-based malware detection for evolving data streams. ACM Trans. Management Inf. Syst. 2, 3 (2011), 16.
    [91]
    Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham. 2008. Mining concept-drifting data stream to detect peer to peer botnet traffic. Tech. rep. UTDCS-05-08, The University of Texas at Dallas, Richardson (2008).
    [92]
    Mohammad M. Masud, Latifur Khan, and Bhavani Thuraisingham. 2007. A scalable multi-level feature extraction technique to detect malicious executables. Information Systems Frontiers 10, 1 (2007), 33--45.
    [93]
    Kirti Mathur and Saroj Hiranwal. 2013. A survey on techniques in detection and analyzing malware executables. International Journal of Advanced Research in Computer Science and Software Engineering 3, 4 (2013), 422--428.
    [94]
    Micropoint. 2008. Micropoint Antivirus. Retrieved from http://www.micropoint.com.cn/Channel//20080626114608.html.
    [95]
    David Moore and Colleen Shannon. 2002. Code-red: A case study on the spread and victims of an internet worm. In Proceedings of the Internet Measurement Workshop.
    [96]
    Andreas Moser, Christopher Kruegel, and Engin Kirda. 2007. Limits of static analysis for malware detection. In Proceedings of the 23rd Annual Computer Security Applications Conference (ACSAC).
    [97]
    Robert Moskovitch, Clint Feher, and Yuval Elovici. 2009. A chronological evaluation of unknown malcode detection. LNCS: Intelligence and Security Informatics 5477 (2009), 112--117.
    [98]
    Robert Moskovitch, Clint Feher, Nir Tzachar, Eugene Berger, Marina Gitelman, Shlomi Dolev, and Yuval Elovici. 2008a. Unknown malcode detection using OPCODE representation. In Proceedings of the European Conference on Intelligence and Security Informatics (EuroISI).
    [99]
    Robert Moskovitch, Nir Nissim, and Yuval Elovici. 2008b. Acquisition of malicious code using active learning. In PinKDD.
    [100]
    Robert Moskovitch, Dima Stopel, Clint Feher, Nir Nissim, and Yuval Elovici. 2008c. Unknown malcode detection via text categorization and the imbalance problem. In IEEE Intelligence and Security Informatics.
    [101]
    Kevin P. Murphy. 2012. Machine learning: A probabilistic perspective. In The MIT Press, Cambridge, Massachusetts.
    [102]
    Ion Muslea, Steven Minton, and Craig A. Knoblock. 2006. Active learning with multiple views. Journal of Artificial Intelligence Research 27 (2006), 203--233.
    [103]
    Carey Nachenberg and Vijay Seshadri. 2010. An analysis of real-world effectiveness of reputation-based security. In Proceedings of the Virus Bulletin Conference (VB).
    [104]
    Nicholas Nethercote and Julian Seward. 2007. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation.
    [105]
    Hieu T. Nguyen and Arnold Smeulders. 2004. Active learning using pre-clustering. In Proceedings of the 21st International Conference on Machine Learning. ACM, 79.
    [106]
    Ming Ni, Tao Li, Qianmu Li, Hong Zhang, and Yanfang Ye. 2016. FindMal: A file-to-file social network based malware detection framework. Knowledge-Based Systems 112 (2016), 142--151.
    [107]
    Corporation of Compuware. 1999. Debugging blue screens. Technical Paper (September 1999).
    [108]
    Gunter Ollmann. 2010. Serial variant evasion tactics techniques used to automatically bypass antivirus technologies. Retrieved from http://www.damballa.com/downloads/rpubs/WPSerialVariantEvasionTactics.pdf.
    [109]
    OllyDump. 2006. PE Tools - OllyDump. Retrieved from http://www.openrce.org/downloads/details/108/OllyDump.
    [110]
    David Orenstein. 2000. Application programming interface (API). In Quick Study: Application Programming Interface (API).
    [111]
    Nikunj C. Oza and Stuart Russell. 2001. Experimental comparisons of online and batch versions of bagging and boosting. In Proceedings of SIGKDD.
    [112]
    Judea Pearl. 1987. Evidential reasoning using stochastic simulation of causal models. Artificial Intelligence 32, 2 (1987), 245--258.
    [113]
    Hanchuan Peng, Fuhui Long, and Chris Ding. 2005. Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 8 (2005), 1226--1238.
    [114]
    Simon Perkins, Kevin Lacker, and James Theiler. 2003. Grafting: Fast incremental feature selection by gradient descent in function space. JMLR 3 (March 2003), 1333--1356.
    [115]
    Qemu. 2016. (2016). http://www.qemu-project.org/index.html.
    [116]
    Internet Security Center Qihoo. 2015. 2014 Internet Security Research Report in China. Retrieved from http://zt.360.cn/report/.
    [117]
    J. Ross Quinlan. 1986. Induction of decision trees. Machine Learning 1, 1 (1986), 81--106.
    [118]
    J. Ross Quinlan. 1993. C4.5: Programs for Machine Learning. San Francisco, CA: Morgan Kaufmann Publishers, Inc. (1993).
    [119]
    Alain Rakotomamonjy. 2003. Variable selection using SVM-based criteria. JMLR 3 (March 2003), 1357--1370.
    [120]
    Zulfikar Ramzan, Vijay Seshadri, and Carey Nachenberg. 2013. Reputation-based security: An analysis of real world effectiveness. In Symantec Security Response.
    [121]
    Rizwan Rehmani, G. C. Hazarika, and Gunadeep Chetia. 2011. Malware threats and mitigation strategies: A Survey. Journal of Theoretical and Applied Information Technology 29, 2 (2011), 69--73.
    [122]
    John Robbins. 1999. Debugging windows based applications using windbg. Microsoft Systems Journal (1999).
    [123]
    Lior Rokach. 2010. Ensemble-based classifiers. Artif Intell Rev 33, 1 (2010), 1--39.
    [124]
    Paul Royal, Mitch Halpin, David Dagon, Robert Edmonds, and Wenke Lee. 2006. PolyUnpack: Automating the hidden-code extraction of unpack-executing malware. In Proceedings of the 22nd Annual Computer Security Applications Conference.
    [125]
    Yvan Saeys, Inaki Inza, and Pedro Larranaga. 2007. A review of feature selection techniques in bioinformatics. Bioinformatics 23, 19 (2007), 2507--2517.
    [126]
    Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975. A vector space model for automatic indexing. Commun. ACM 18, 11 (1975), 613--620.
    [127]
    Igor Santos, Jaime Devesa, Felix Brezo, Javier Nieves, and Pablo Garcia Bringas. 2013. OPEM: A static-dynamic approach for machine learning based malware detection. In Proceedings of International Conference CISIS-ICEUTE, Special Sessions Advances in Intelligent Systems and Computing.
    [128]
    Igor Santos, Carlos Laorden, and Pablo G. Bringas. 2011a. Collective classification for unknown malware detection. In Proceedings of the International Conference on Security and Cryptography.
    [129]
    Igor Santos, Javier Nieves, and Pablo G. Bringas. 2011b. Semi-supervised learning for unknown malware detection. In International Symposium on Distributed Computing and Artificial Intelligence Advances in Intelligent and Soft Computing.
    [130]
    Joshua Saxe and Konstantin Berlin. 2015. Deep neural network based malware detection using two dimensional binary program features. In Proceedings of the 10th International Conference on Malicious and Unwanted Software (MALWARE).
    [131]
    Matthew G. Schultz, Eleazar Eskin, F. Zadok, and Salvatore J. Stolfo. 2001. Data mining methods for detection of new malicious executables. In Proc. of the IEEE Symposium on Security and Privacy.
    [132]
    Fabrizio Sebastiani. 2002. Text categorization. Comput. Surveys 34, 1 (2002), 1--47.
    [133]
    H. Sebastian Seung, Manfred Opper, and Haim Sompolinsky. 1992. Query by committee. In Proceedings of the 5th Annual Workshop on Computational Learning Theory. ACM, 287--294.
    [134]
    Asaf Shabtai, Robert Moskovitch, Yuval Elovici, and Chanan Glezer. 2009. Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey. Information Security Technical Report 14, 1 (2009), 16--29.
    [135]
    Muazzam Siddiqui, Morgan C. Wang, and Joohan Lee. 2008. A survey of data mining techniques for malware detection using file features. In Proceedings of ACM-SE.
    [136]
    Muazzam Siddiqui, Morgan C. Wang, and Joohan Lee. 2009. Detecting internet worms using data mining techniques. Journal of Systemics, Cybernetics and Informatics 6, 6 (2009), 48--53.
    [137]
    Dawn Song, David Brumley, Heng Yin, Juan Caballero, Ivan Jager, Min Gyung Kang, Zhenkai Liang, James Newsome, Pongsin Poosankam, and Prateek Saxena. 2008. BitBlaze: A new approach to computer security via binary analysis. In Proceedings of the 4th International Conference on Information Systems Security.
    [138]
    Eugene H. Spafford. 1989. The internet worm incident. In Proceedings of the 2nd European Software Engineering Conference.
    [139]
    Elizabeth Stinson and John C. Mitchell. 2007. Characterizing bots’ remote control behavior. LNCS: Detection of Intrusions and Malware, and Vulnerability Assessment 4579 (2007), 89--108.
    [140]
    Jack W. Stokes, John C. Platt, Helen J. Wang, Joe Faulhaber, Jonathan Keller, Mady Marinescu, Anil Thomas, and Marius Gheorghescu. 2012. Scalable telemetry classification for automated malware detection. Computer Security - ESORICS (2012).
    [141]
    Andrew H. Sung, Jianyun Xu, Patrick Chavez, and Srinivas Mukkamala. 2004. Static analyzer of vicious executables (SAVE). In Proceedings of the 20th Annual Computer Security Applications Conference.
    [142]
    Symantec. 2008. Symantec global internet security threat report. Retrieved from http://eval.symantec.com/mktginfo/enterprise/white_papers/b-whitepaper_internet_security_threat_report_xiii_04-2008.en-us.pdf.
    [143]
    Symantec. 2014a. Internet Security Threat Report 2014. Retrieved from http://www.symantec.com/security_response/publications/threatreport.jsp.
    [144]
    Symantec. 2014b. The Threat Landscape in 2014 and Beyond: Symantec and Norton Predictions for 2015, Asia Pacific and Japan. Retrieved from http://www.symantec.com/connect/blogs/threat-landscape-2014-and-beyond-symantec-and-norton-predictions-2015-asia-pacific-japan.
    [145]
    Symantec. 2016. Internet Security Threat Report. Retrieved from https://www.symantec.com/content/dam/symantec/docs/reports/istr-21-2016-en.pdf.
    [146]
    Kimberly Tam, Ali Feizollah, Nor Badrul Anuar, Rosli Salleh, and Lorenzo Cavallaro. 2017. The evolution of android malware and android analysis techniques. ACM Computing Surveys (CSUR) 49, 4 (2017), 76.
    [147]
    Acar Tamersoy, Kevin Roundy, and Duen Horng Chau. 2014. Guilt by association: Large scale malware detection by mining file-relation graphs. In Proccedings of ACM International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD).
    [148]
    Fadi Abdeljaber Thabtah. 2007. A review of associative classification mining. Knowledge Engineering Review 22, 1 (2007), 37--65.
    [149]
    Ronghua Tian, Rafiqul Islam, Lynn Batten, and Steve Versteeg. 2010. Differentiating malware from cleanwares using behavioral analysis. In Proceedings of 5th International Conference on Malicious and Unwanted Software (Malware).
    [150]
    TrendLabs. 2014. The invisible becomes visible: Trend micro security predictions for 2015 and beyond. (2014). http://www.trendmicro.com/cloud-content/us/pdfs/security-intelligence/reports/rpt-the-invisible-becomes-visible.pdf.
    [151]
    Trend Threat Research Team TrendMicro. 2010. Zeus: A Persistent Criminal Enterprise. Retrieved from http://www.trendmicro.com/cloud-content/us/pdfs/security-intelligence/white-papers/wp_zeuspersistent-criminal-enterprise.pdf.
    [152]
    Amit Vasudevan and Ramesh Yerraballi. 2005. Stealth breakpoints. In Proceedings of the 21st Annual Computer Security Applications Conference.
    [153]
    Amit Vasudevan and Ramesh Yerraballi. 2006. Cobra: Fine-grained malware analysis using stealth localized-executions. In Proceedings of 2006 IEEE Symposium on Security and Privacy.
    [154]
    Shobha Venkataraman, Avrim Blum, and Dawn Song. 2008. Limits of learning-based signature generation with adversaries. In NDSS.
    [155]
    Andrei Venzhega, Polina Zhinalieva, and Nikolay Suboch. 2013. Graph-based malware distributors detection. In Proceedings of the 22nd International Conference on World Wide Web Companion (WWW).
    [156]
    Randall Wald, Taghi M. Khoshgoftaar, and Amri Napolitano. 2013. Comparison of stability for different families of filter-based and wrapper-based feature selection. In ICMLA.
    [157]
    Tzu-Yen Wang, Shi-Jinn Horng, Ming-Yang Su, Chin-Hsiung Wu, Peng-Chu Wang, and Wei-Zen Su. 2006b. A surveillance spyware detection system based on data mining methods. Evolutionary Computation (2006), 3236--3241.
    [158]
    Yi-Min Wang, Doug Beck, Xuxian Jiang, and Roussi Roussev. 2006a. Automated web patrol with strider honeymonkeys: Finding web sites that exploit browser vulnerabilities. In NDSS.
    [159]
    SECURITY LABS WEBSENSE. 2014. 2015 Security Predictions. Retrieved from http://www.websense.com/assets/reports/report-2015-security-predictions-en.pdf.
    [160]
    Paul Werbos. 1974. Beyond regression: New tools for prediction and analysis in the behavioral science. Ph.D. Dissertation, Harvard University.
    [161]
    Wikipedia. 2016. Scareware. Retrieved from https://en.wikipedia.org/wiki/Scareware.
    [162]
    Wikipedia. 2017a. Assembly Language. Retrieved from http://en.wikipedia.org/wiki/Assembly_language.
    [163]
    Wikipedia. 2017b. Computer Virus. Retrieved from http://en.wikipedia.org/wiki/Computer_virus.
    [164]
    Wikipedia. 2017c. Morris Worm. Retrieved from http://en.wikipedia.org/wiki/Morris_worm.
    [165]
    Wikipedia. 2017d. Ransomware. Retrieved from https://en.wikipedia.org/wiki/Ransomware.
    [166]
    Wikipedia. 2017e. Rootkit. Retrieved from http://en.wikipedia.org/wiki/Rootkit.
    [167]
    Wikipedia. 2017f. Zero-day (computing). Retrieved from https://en.wikipedia.org/wiki/Zero-day_(computing).
    [168]
    Wikipedia. 2017g. Zeus (malware). Retrieved from http://en.wikipedia.org/wiki/Zeus_(malware).
    [169]
    Carsten Willems, Thorsten Holz, and Felix Freiling. 2007. Toward automated dynamic malware analysis using cwsandbox. In IEEE Security and Privacy.
    [170]
    Rui Xu and Donald Wunsch. 2005. Survey of clustering algorithms. In IEEE Transactions on Neural Networks 16, 3 (2005), 645--678.
    [171]
    Yanfang Ye. 2010. Research on intelligent malware detection methods and their applications. Ph.D. Dissertation, Department of Computer Science, Xiamen University (2010).
    [172]
    Yanfang Ye, Lifei Chen, Dingding Wang, Tao Li, Qingshan Jiang, and Min Zhao. 2009. SBMDS: An interpretable string based malware detection system using SVM ensemble with bagging. Journal in Computer Virology 5, 4 (2009), 283--293.
    [173]
    Yanfang Ye, Tao Li, Yong Chen, and Qingshan Jiang. 2010. Automatic malware categorization using cluster ensemble. In Proccedings of ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD).
    [174]
    Yanfang Ye, Tao Li, Kai Huang, Qingshan Jiang, and Yong Chen. 2009a. Hierarchical associative classifier (HAC) for malware detection from the large and imbalanced gray list. Journal of Intelligent Information Systems 35, 1 (2009), 1--20.
    [175]
    Yanfang Ye, Tao Li, Qingshan Jiang, Zhixue Han, and Li Wan. 2009c. Intelligent file scoring system for malware detection from the gray list. In Proccedings of ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD).
    [176]
    Yanfang Ye, Tao Li, Qingshan Jiang, and Youyu Wang. 2009b. CIMDS: Adapting post-processing techniques of associative classification for malware detection system. IEEE Transactions on Systems, Man, and Cybernetics 40, 3 (2009), 298--307.
    [177]
    Yanfang Ye, Tao Li, Shenghuo Zhu, Weiwei Zhuang, Egemen Tas, Umesh Gupta, and Melih Abdulhayoglu. 2011. Combining file content and file relations for cloud based malware detection. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD).
    [178]
    Yanfang Ye, Dingding Wang, Tao Li, and Dongyi Ye. 2007. IMDS: Intelligent malware detection system. In Proccedings of ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD).
    [179]
    Yanfang Ye, Dingding Wang, Tao Li, Dongyi Ye, and Qingshan Jiang. 2008. An intelligent PE-malware detection system based on association mining. Journal in Computer Virology 4, 4 (2008), 323--334.
    [180]
    Jonathan S. Yedidia, William T. Freeman, and Yair Weiss. 2001. Understanding belief propagation and its generalizations. In Mitsubishi Electric Research Laboratories.
    [181]
    Heng Yin, Dawn Song, Manuel Egele, Christopher Kruegel, and Engin Kirda. 2007. Panorama: Capturing system-wide information flow for malware detection and analysis. In Proceedings of the 14th ACM Conference on Computer and Communications Security (CCS).
    [182]
    Chunqiu Zeng, Liang Tang, Wubai Zhou, Tao Li, Larisa Shwartz, and Genady Ya.Grabarnik. 2017. An integrated framework for mining temporal logs from fluctuating events. IEEE Transactions on Services Computing (TSC) (2017). In Press.
    [183]
    Boyun Zhang, Jianping Yin, Jingbo Hao, Dingxing Zhang, and Shulin Wang. 2007. Malicious codes detection based on ensemble learning. Autonomic and Trusted Computing (2007).
    [184]
    Jianwei Zhuge, Thorsten Holz, Chengyu Song, Jinpeng Guo, Xinhui Han, and Wei Zou. 2008. Studying malicious websites and the underground economy on the Chinese web. In Proceedings of the 7th Workshop on Economics of Information Security.

    Cited By

    View all
    • (2024)A Hybrid Rider Optimization with Deep Learning Driven Intrusion Detection Farmwork in Wireless Sensor NetworkSalud, Ciencia y Tecnología - Serie de Conferencias10.56294/sctconf20247623(762)Online publication date: 13-May-2024
    • (2024)Revolutionizing Malware DetectionInnovations, Securities, and Case Studies Across Healthcare, Business, and Technology10.4018/979-8-3693-1906-2.ch011(196-220)Online publication date: 12-Apr-2024
    • (2024)CSMC: A Secure and Efficient Visualized Malware Classification Method Inspired by Compressed SensingSensors10.3390/s2413425324:13(4253)Online publication date: 30-Jun-2024
    • Show More Cited By

    Index Terms

    1. A Survey on Malware Detection Using Data Mining Techniques

        Recommendations

        Reviews

        Klerisson Paixao

        It is not new that software is eating the world [1]. Industries and businesses everywhere are being "softwareized." Meanwhile, we cannot deny that malware (malicious software) is also having a feast. This paper provides a comprehensive survey of existing technology for malware detection focused on data mining techniques. It starts with a taxonomy, primarily based on common types of malware: viruses, worms, Trojans, spyware, ransomware, scareware, bots, rootkits, and hybrid malware. Then, the paper describes the current state of the (anti-)malware industry. The study is a bit short on the data mining techniques used. The authors restrain their efforts to describing detections relying on classification and clustering algorithms. On the other hand, it does a very good job at summarizing dozens of methods used in the literature. Further, the authors suggest new ideas for future research directions. Notably, they discuss the application of active learning to the task. Such a technique seems more appropriate to deal with a critical problem in the field: data scarcity. While cybercriminals usually cooperate and collaborate to build their malware, their counterparts keep collections of cybercrime data under lock. The paper ends with a clear conclusion: there is no silver bullet when it comes to malware detection. All classification/clustering techniques have their pros and cons; thus, they will not always perform optimally. This survey serves well as a starting point and initial set of guidelines for people willing to do research in this field. Online Computing Reviews Service

        Access critical reviews of Computing literature here

        Become a reviewer for Computing Reviews.

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Computing Surveys
        ACM Computing Surveys  Volume 50, Issue 3
        May 2018
        550 pages
        ISSN:0360-0300
        EISSN:1557-7341
        DOI:10.1145/3101309
        • Editor:
        • Sartaj Sahni
        Issue’s Table of Contents
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 29 June 2017
        Accepted: 01 March 2017
        Revised: 01 November 2016
        Received: 01 August 2015
        Published in CSUR Volume 50, Issue 3

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Survey
        2. data mining
        3. malware detection

        Qualifiers

        • Survey
        • Research
        • Refereed

        Funding Sources

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)2,469
        • Downloads (Last 6 weeks)217
        Reflects downloads up to 14 Aug 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)A Hybrid Rider Optimization with Deep Learning Driven Intrusion Detection Farmwork in Wireless Sensor NetworkSalud, Ciencia y Tecnología - Serie de Conferencias10.56294/sctconf20247623(762)Online publication date: 13-May-2024
        • (2024)Revolutionizing Malware DetectionInnovations, Securities, and Case Studies Across Healthcare, Business, and Technology10.4018/979-8-3693-1906-2.ch011(196-220)Online publication date: 12-Apr-2024
        • (2024)CSMC: A Secure and Efficient Visualized Malware Classification Method Inspired by Compressed SensingSensors10.3390/s2413425324:13(4253)Online publication date: 30-Jun-2024
        • (2024)A Packet Content-Oriented Remote Code Execution Attack Payload Detection ModelFuture Internet10.3390/fi1607023516:7(235)Online publication date: 2-Jul-2024
        • (2024)MalOSDF: An Opcode Slice-Based Malware Detection Framework Using Active and Ensemble LearningElectronics10.3390/electronics1302035913:2(359)Online publication date: 15-Jan-2024
        • (2024)An effective deep learning scheme for android malware detection leveraging performance metrics and computational resourcesIntelligent Decision Technologies10.3233/IDT-23028418:1(33-55)Online publication date: 1-Jan-2024
        • (2024)Intelligent Anomaly Detection System Based on Ensemble and Deep Learning2024 26th International Conference on Advanced Communications Technology (ICACT)10.23919/ICACT60172.2024.10471923(137-142)Online publication date: 4-Feb-2024
        • (2024)An Evaluation of Current Malware Trends and Defense Techniques: A Scoping Review with Empirical Case StudiesJournal of Advances in Information Technology10.12720/jait.15.5.649-671(649-671)Online publication date: 2024
        • (2024)MalwareTotal: Multi-Faceted and Sequence-Aware Bypass Tactics against Static Malware DetectionProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639141(1-12)Online publication date: 20-May-2024
        • (2024)API2Vec++: Boosting API Sequence Representation for Malware Detection and ClassificationIEEE Transactions on Software Engineering10.1109/TSE.2024.342299050:8(2142-2162)Online publication date: Aug-2024
        • Show More Cited By

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Get Access

        Login options

        Full Access

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media