Abstract
Generic computer virus detection is the need of the hour as most commercial antivirus software fail to detect unknown and new viruses. Motivated by the success of datamining/machine learning techniques in intrusion detection systems, recent research in detecting malicious executables is directed towards devising efficient non-signature-based techniques that can profile the program characteristics from a set of training examples. Byte sequences and byte n-grams are considered to be basis of feature extraction. But as the number of n-grams is going to be very large, several methods of feature selections were proposed in literature. A recent report on use of information gain based feature selection has yielded the best-known result in classifying malicious executables from benign ones. We observe that information gain models the presence of n-gram in one class and its absence in the other. Through a simple example we show that this may lead to erroneous results. In this paper, we describe a new feature selection measure, class-wise document frequency of byte n-grams. We empirically demonstrate that the proposed method is a better method for feature selection. For detection, we combine several classifiers using Dempster Shafer Theory for better classification accuracy instead of using any single classifier. Our experimental results show that such a scheme detects virus program far more efficiently than the earlier known methods.
Similar content being viewed by others
References
Abou-Assaleh, T., Cercone, N., Keselj, V., Sweidan, R.: Detection of new malicious code using n-grams signatures. In: PST, pp. 193–196 (2004)
Arnold, W., Tesauro, G.: Automatically generated win32 heuristic virus detection. In: Proceedings of the 2000 International Virus Bulletin Conference (2000)
Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175. Las Vegas, US (1994)
Cohen F. (1987) Computer viruses: theory and experiments. Comput. Secur. 6(1):22–35
Christodorescu, M., Jha, S.: Static analysis of executables to detect malicious patterns. In: Proceedings of the 12th USENIX Security Symposium (Security’03), pp. 169–186. USENIX Association, USENIX Association (2003)
Duin, R.P.W., Tax, D.M.J.: Experiments with classifier combining rules. In: MCS ’00: Proceedings of the First International Workshop on Multiple Classifier Systems, London, pp. 16–29. Springer, Berlin Heidelberg New York (2000)
Karim Md.E., Walenstein A., Lakhotia A., Parida L. (2005) Malware phylogeny generation using permutations of code. J. Comput. Virol. 1(1–2):13–23
Gartner Inc: http://www.gartner.com/press_releases/asset_129199_11.html (2005)
Johannes, F.: A study using n-gram features for text categorization. Technical Report OEFAI-TR-9830, Austrian Institute for Artificial Intelligence (1998)
Kephart, J.O., Sorkin, G.B., Arnold, W.C., Chess, D.M., , G.J., White, S.R.: Biologically inspired defenses against computer viruses. In: Proceedings of the 14th IJCAI, pp. 985–996, Montreal (1995)
Kolter, J.Z., Maloof, M.A.: Learning to detect malicious executables in the wild. In: KDD ’04: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 470–478. ACM Press, New York (2004)
Lefevre E., Colot O., Vannoorenberghe P. (2002) Belief function combination and conflict management. Inf. Fusion 3(2):149–162
McGraw G., Morrisett G. (2000) Attacking malicious code: a report to the infosec research council. IEEE Soft. 17(5):33–41
Mitchell T.M. (1997) Machine Learning. McGraw-Hill, New York
Murphy C.K. (2000) Combining belief functions when evidence conflicts. Decis. Support Syst. 29(1):1–9
Nachenberg, C.: Understanding and managing polymorphic viruses. Technical Report, The Symantec Exterprise Papers: Vol. XXX
Shafer G. (1976) A Mathematical Theory of Evidence. Princeton University Press, Princeton
Schultz, M.G., Eskin, E., Zadok, E., Bhattacharyya, M., Stolfo, S.J.: Mef: Malicious email filter – a unix mail filter that detects malicious windows executables. In: Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference, pp. 245–252. USENIX Association, Berkeley (2001)
Schultz, M.G., Eskin, E., Zadok, E., Stolfo, S.J.: Data mining methods for detection of new malicious executables. In: SP ’01: Proceedings of the 2001 IEEE Symposium on Security and Privacy, p. 38. IEEE Computer Society, Washington (2001)
Sentz, K.: Combination of evidence in Dempster–Shafer theory. Ph.D. Thesis, SNL, LANL, and Systems Science and Industrial Engineering Department, Binghamton University
Smets P. (1993) Belief functions: The disjunctive rule of combination and the generalized bayesian theorem. Int. J. Approx. Reason. 9(1):1–35
Szor P. (2005) The Art of Computer Virus Research and Defense. Addison Wesley, Reading
Ting K.M., Witten I.H. (1999) Issues in stacked generalization. J. Artif. Intell. Res. 10, 271–289
Vx heavens: http://www.vx.netlux.org
Witten I., Frank E. (2000) Data mining: Practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San Francisco
Wolpert, D.H.: Stacked generalization. Technical Report LA-UR-90-3460, Los Alamos (1990)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML-97, 14th International Conference on Machine Learning, Nashville, pp. 412–420. Morgan Kaufmann, San Francisco (1997)
Yoo, I., Ultes-Nitsche, U.: Non-signature based virus detection: Towards establishing unknown virus detection technique using som. J. Comput. Virol. 2(3) (2006)
Zhang, B., Srihari, S.N.: Class-wise multi-classifier combination based on dempster-shafer theory. In: Proceedings of the 7th International Conference on Control, Automation, Robotics and Vision (2002)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Reddy, D.K.S., Pujari, A.K. N-gram analysis for computer virus detection. J Comput Virol 2, 231–239 (2006). https://doi.org/10.1007/s11416-006-0027-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11416-006-0027-8