Malware Detection Using Statistical Analysys Byte Level File Content
Malware Detection Using Statistical Analysys Byte Level File Content
ABSTRACT 1. INTRODUCTION
Commercial anti-virus software are unable to provide pro- Sophisticated malware is becoming a major threat to the
tection against newly launched (a.k.a “zero-day”) malware. usability, security and privacy of computer systems and net-
In this paper, we propose a novel malware detection tech- works worldwide [1], [2]. A wide range of host-based so-
nique which is based on the analysis of byte-level file con- lutions have been proposed by researchers and a number
tent. The novelty of our approach, compared with existing of commercial anti-virus (AV) software are also available in
content based mining schemes, is that it does not memo- the market [5]–[21]. These techniques can broadly be clas-
rize specific byte-sequences or strings appearing in the ac- sified into two types: (1) static, and (2) dynamic. Static
tual file content. Our technique is non-signature based and techniques mostly operate on machine-level code and dis-
therefore has the potential to detect previously unknown and assembled instructions. In comparison, dynamic techniques
zero-day malware. We compute a wide range of statistical mostly monitor the behavior of a program with the help of
and information-theoretic features in a block-wise manner an API call sequence generated at run-time. The applica-
to quantify the byte-level file content. We leverage standard tion of dynamic techniques in AV products is of limited use
data mining algorithms to classify the file content of every because of the large processing overheads incurred during
block as normal or potentially malicious. Finally, we corre- run-time monitoring of API calls; as a result, the perfor-
late the block-wise classification results of a given file to cat- mance of computer systems significantly degrades. In com-
egorize it as benign or malware. Since the proposed scheme parison, the processing overhead is not a serious concern for
operates at the byte-level file content; therefore, it does not static techniques because the scanning activity can be sched-
require any a priori information about the filetype. We have uled offline in an idle time. Moreover, static techniques can
tested our proposed technique using a benign dataset com- also be deployed as an in-cloud network service that moves
prising of six different filetypes — DOC, EXE, JPG, MP3, PDF complexity from an end-point to the network cloud [28].
and ZIP and a malware dataset comprising of six different Almost all static malware detection techniques including
malware types — backdoor, trojan, virus, worm, construc- commercial AV software — either signature-, or heuristic-,
tor and miscellaneous. We also perform a comparison with or anomaly-based — use specific content signatures such as
existing data mining based malware detection techniques. byte sequences and strings. A major problem with the con-
The results of our experiments show that the proposed non- tent signatures is that they can easily be defeated by packing
signature based technique surpasses the existing techniques and basic code obfuscation techniques [3]. In fact, the ma-
and achieves more than 90% detection accuracy. jority of malware that appears today is a simple repacked
version of old malware [4]. As a result, it effectively evades
Categories and Subject Descriptors the content signatures of old malware stored in the database
of commercial AV products. To conclude, existing commer-
D.4.6 [Security and Protection]: Invasive Software cial AV products cannot even detect a simple repacked ver-
sion of previously detected malware.
General Terms The security community has expanded significant effort in
Experimentation, Security application of data mining techniques to discover patterns
in the malware content, which are not easily evaded by code
obfuscation techniques. Two most well-known data mining
Keywords based malware detection techniques are ‘strings’ (proposed
Computer Malware, Data Mining, Forensics by Schultz et al [7]) and ‘KM’ (proposed by Kolter et al [8]).
We take these techniques as a benchmark for comparative
study of our proposed scheme.
The novelty of our proposed technique — in contrast to
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are the existing data mining based technique — is its purely
not made or distributed for profit or commercial advantage and that copies non-signature paradigm: it does not remember exact file
bear this notice and the full citation on the first page. To copy otherwise, to content/contents for malware detection. It is a static mal-
republish, to post on servers or to redistribute to lists, requires prior specific ware detection technique which should be, intuitively speak-
permission and/or a fee. ing, robust to the most commonly used evasion techniques.
CSI-KDD’09, June 28, 2009, Paris, France.
The proposed technique computes a diverse set of statistical
Copyright 2009 ACM 978-1-60558-669-4 ...$5.00.
and information-theoretic features in a block-wise manner on a dataset that consists of 1, 001 benign and 3, 265 mali-
on the byte-level file content. The generated feature vec- cious executables. These executables have 206 benign and
tor of every block is then given as an input to standard 38 malicious samples in the portable executable (PE) file
data mining algorithms (J48 decision trees) which classify format. They have collected most of the benign executa-
the block as normal (n) or potentially malicious (pm). Fi- bles from Windows 98 systems. They use three different
nally, the classification results of all blocks are correlated approaches to statically extract features from executables.
to categorize the given file as benign (B) or malware (M). If The first approach extracts DLL information inside PE
a file is split into k equally-sized blocks (b1 , b2 , b3 , · · · , bk ) executables. Further, the DLL information is extracted us-
and n statistical features are computed for every k-th block ing three types of feature vectors: (1) the list of DLLs (30
(fk,1 , fk,2 , fk,3 , · · · , fk,n ), then mathematically our scheme boolean values), (2) the list of DLL function calls (2, 229
can be represented as: boolean values), and (3) the number of different function
0b 1 0 1 0 n/pm 1
calls within each DLL (30 integer values). RIPPER — an
inductive rule-learning algorithm — is used on top of every
f1,1 , f1,2 · · · f1,n
B CC BB C B n/pm C
1
B C B C
feature vector for classification. These schemes based on
b f2,1 , f2,2 · · · f2,n
B CA B@ C B
@ ... C
2 F D C DLL information provides an overall detection accuracy of
@ ... ⇒ ..
. A ⇒ A ⇒
B/M,
83.62%, 88.36% and 89.07% respectively. Enough details
bk fk,1 , fk,2 · · · fk,n n/pm about the DLLs are not provided, so we could not imple-
ment this scheme in our study.
where (F) is a suitable feature set, (D) is a data mining The second feature extraction approach extracts strings
algorithm for classification of individual blocks. The file from the executables using GNU strings program. Naı̈ve
is eventually categorized as benign (B) or malware (M) by Bayes classifier is used on top of extracted strings for mal-
the correlation module (C). Once a suitable feature set (F) ware detection. This scheme provides an overall detection
and a data mining algorithm (D) are selected, we test the accuracy of 97.11%. This scheme is reported to give the
accuracy of the solution using a benign dataset consisting of best results amongst all, and we have implemented it for
six filetypes: DOC, EXE, JPG, MP3, PDF and ZIP; and a malware our comparative study.
dataset comprising of six different malware types: backdoor, The third feature extraction approach uses byte sequences
trojan, virus, worm, constructor and miscellaneous. The (n-grams) using hexdump. The authors do not explicitly
results of our experiments show that our scheme is able to specify the value of n used in their study. However, from
provide more than 90% detection accuracy1 for detecting an example provided in the paper, we deduce it to be 2
malware, which is an encouraging outcome. To the best (bi-grams). The Multi-Naı̈ve Bayes algorithm is used for
of our knowledge, this is the first pure non-signature based classification. This algorithm uses voting by a collection of
data mining malware detection technique using statistical individual Naı̈ve Bayes instances. This scheme provides an
analysis of the static byte-level file contents. overall detection accuracy of 96.88%.
The rest of the paper is organized as follows. In Section The results of their experiments reveal that Naı̈ve Bayes
2 we provide a brief overview of related work in the domain algorithm with strings is the most effective approach for de-
of data mining malware detection techniques. We describe tecting the unseen malicious executables with reasonable
the detailed architecture of our malware detection technique processing overheads. The authors acknowledge the fact
in Section 3. We discuss the dataset in Section 4. We re- that the string features are not robust and can be easily
port the results of pilot studies in Section 5. In Section 6, defeated. Multi-Naı̈ve Bayes with byte sequences also pro-
we discuss the knowledge discovery process of the proposed vides a relatively high detection accuracy, however, it has
technique by visualizing the learning models of data mining large processing and memory requirements. Byte sequence
algorithms used for classification. Finally, we conclude the technique was later improved by Kolter et al and is explained
paper with an outlook to our future work. below.
3.1 Block Generator Module (B) where we have used λ = 3 as suggested in [31].
The block generator module divides the byte-level con-
tents of a given file into fixed-sized chunks — known as 3.2.4 Manhattan Distance
blocks. We have used blocks for reducing the processing It is a special case of Minkowski distance [31] with λ = 1.
overhead of the module. In future, we want to analyze the
benefit of using variable-sized blocks as well. Remember
that using a suitable block size plays a critical role in defin-
Xn
mhi = | Xj − Xj+1 | (4)
ing the accuracy of our framework because it puts a lower j=0
limit on the minimum size of malware that our framework
can detect. We have to compromise a trade-off between the 2
In rest of the paper, we use the generic term n-grams once
amount of available information per block and the accuracy we want to refer to all 4-grams separately.
Figure 1: Architecture of our proposed technique
P n
Xj · Xj+1
where t(ri ) is the frequency of n-grams in a given block.
ASi = P
( n
j=0
Xj2 ·
P n 2
Xj+1 )1/2
(7)
j=0 j=0
3.2.10 Kullback - Leibler Divergence
Table 1: Statistics of benign files used in this study
KL Divergence is a measure of the difference between two Filetype Qty. Avg. Size Min. Size Max. Size
probability distributions [31]. It is often referred to as a dis- (kilo-bytes) (kilo-bytes) (kilo-bytes)
tance measure between two distributions. Mathematically, DOC 300 1, 015.2 44 7, 706
EXE 300 4, 095.0 48 15, 005
it is represented as: JPG 300 1, 097.8 3 1, 629
X
n
Xj
MP3
PDF
300
300
3, 384.4
1, 513.1
654
25
6, 210
20, 188
KLi (Xj || Xj+1 ) = Xj log (10)
j=0
Xj+1 ZIP 300 1, 489.6 8 9, 860
tp rate
F3 1 0.866
F4 2 0.891
F1-F2 1, 1 0.940 0.4 Boosted IBK
F1-F3 1, 1 0.928 IBK
F1-F4 1, 2 0.932 Boosted NB
0.2 NB
F2-F4 1, 2 0.929
Boosted J48
F1-F2-F4 1, 1, 2 0.962
J48
F3-F2 1, 1 0.954 0
F3-F4 1, 2 0.913 0 0.2 0.4 0.6 0.8 1
fp rate
F1-F2-F3-F4 1, 1, 1, 2 0.956
tp rate
(a) between Backdoor and Benign files
0.4 Virus (AUC = 0.945)
| | CorelationCoefficient1 > 0.619523 Worm (AUC = 0.919)
| | | Chebyshev4 <=1.405 Trojan (AUC = 0.881)
| | | | Itakura2 <= 87.231983: Malicious (352.0/9.0) 0.2 Backdoor (AUC = 0.849)
| | | | Itakura2 > 87.231983 Constructor (AUC = 0.925)
| | | | | TotalVariation1 <= 0.3415: Malicious (11.0) Miscellaneous (AUC = 0.903)
| | | | | TotalVariation1 > 0.3415: Benign (8.0) 0
0 0.2 0.4 0.6 0.8 1
fp rate
(b) between Trojan and Benign files
| | CorelationCoefficient4 > 0.187794 Figure 3: ROC plot for detecting malware from be-
| | | Simpson_Index_1 <= 0.005703
| | | | CorelationCoefficient3 <= 0.17584: Malicious (32.0)
nign files
| | | | CorelationCoefficient3 > 0.17584
| | | | | Entropy1 <= 4.969689: Malicious (4.0/1.0)
| | | | | Entropy1 > 4.969689: Benign (5.0)
| | | Simpson_Index_1 > 0.005703: Benign (11.0) 7. CONCLUSION & FUTURE WORK
(c) between Virus and Benign files In this paper we have proposed a non-signature based
technique which analyzes the byte-level file content. We
| | | Entropy2 > 3.00231 argue that such a technique provides implicit robustness
| | | | Canberra2 <= 49.348481
| | | | | Canberra1 <= 14.567909: Malicious (161.0) against common obfuscation techniques — especially repacked
| | | | | Canberra1 > 14.567909 malware to obfuscate signatures. An outcome of our re-
| | | | | | Itakura3 <= 126.178699: Malicious (5.0) search is that malicious and benign files are inherently dif-
| | | | | | Itakura3 > 126.178699: Benign (5.0)
| | | | Canberra2 > 49.348481: Benign (13.0/1.0) ferent even at the byte-level.
The proposed scheme uses a rich features’ set of 13 differ-
(d) between Worm and Benign files ent statistical and information-theoretic features computed
CorelationCoefficient3 <= -0.013832 on 1-, 2-, 3- and 4-grams of each block of a file. Once we
| Itakura2 <= 5.905754 have calculated our features’ set, we give it as an input
| | Itakura3 <= 5.208592 to the boosted decision tree (J48) classifier. The choice of
| | | CorelationCoefficient4 <= -0.155078: Malicious (7.0)
| | | CorelationCoefficient4 > -0.155078: Benign (43.0/6.0) features’ set and classifier is an outcome of extensive pilot
| | Itakura3 > 5.208592: Benign (303.0/5.0) studies done to explore the design space. The pilot stud-
ies demonstrate the benefit of our approach compared with
(e) between Constructor and Benign files
other well-known data mining techniques: strings and KM
| Entropy4 > 6.754364 approach. We have tested our solution on an extensive ex-
| | KL1 <= 0.698772
| | | Manhattan3 <= 0.001003
ecutable dataset. The results of our experiments show that
| | | | Entropy2 <= 6.411063: Malicious (29.0) our technique achieves 90% detection accuracy for different
| | | | Entropy2 > 6.411063 malware types. Another important feature of our framework
| | | | | KL1 <= 0.333918: Malicious (2.0) is that it can also classify the family of a given malware file
| | | | | KL1 > 0.333918: Benign (2.0)
| | | Manhattan3 > 0.001003: Benign (3.0) i.e. virus, trojan etc.
In future, we would like to evaluate our scheme on a larger
(f ) between Miscellaneous and Benign files dataset of benign and malicious executables and reverse en-
gineer the features’ set for further improving the detection
accuracy. Moreover, we plan to evaluate the robustness of
The novelty of our scheme lies in the way the selected
our proposed technique on a customized dataset containing
features have been computed — per block n-gram analysis,
manually packed executable files.
and the correlation between the blocks classified as benign
or potentially malicious. The features used in our study are
taken from statistics and information theory. Many of these
features have already been used by researchers in other fields Acknowledgments
for similar classification problems. The chosen set of features This work is supported by the National ICT R&D Fund,
is not, by any means, the optimal collection. The selection of Ministry of Information Technology, Government of Pak-
optimal number of features remains an interesting problem istan. The information, data, comments, and views detailed
which we plan to explore in our future work. Moreover, the herein may not necessarily reflect the endorsements of views
executable dataset used in our study contained both packed of the National ICT R&D Fund.
and non-packed PE files. We plan to evaluate the robustness We acknowledge M.A. Maloof and J.Z. Kolter for their
of our proposed technique on manually crafted packed file valuable feedback regarding the implementation of strings
dataset. and KM approaches. Their comments were of great help in
establishing the experimental testbed used in our study. We
also acknowledge the anonymous reviewers for their valuable
suggestions pertaining to possible extensions of our study.
8. REFERENCES [16] S.J. Stolfo, K. Wang, W.J. Li, “Towards Stealthy
[1] Symantec Internet Security Threat Reports I-XI (Jan Malware Detection”, Advances in Information Security,
2002—Jan 2008). Vol. 27, pp. 231-249, Springer, USA, 2007.
[2] F-Secure Corporation, “F-Secure Reports Amount of [17] W.J. Li, S.J. Stolfo, A. Stavrou, E. Androulaki, A.D.
Malware Grew by 100% during 2007”, Press release, Keromytis, “A Study of Malcode-Bearing Documents”,
2007. International Conference on Detection of Intrusions &
[3] A. Stepan, “Improving Proactive Detection of Packed Malware, and Vulnerability Assessment (DIMVA), pp.
Malware”, Virus Buletin, March 2006, available at 231-250, Springer, Switzerland, 2007.
http://www.virusbtn.com/virusbulletin/ [18] M.Z. Shafiq, S.A. Khayam, M. Farooq, “Embedded
archive/2006/03/vb200603-packed.dkb Malware Detection using Markov n-Grams”,
[4] R. Perdisci, A. Lanzi, W. Lee, “Classification of Packed International Conference on Detection of Intrusions &
Executables for Accurate Computer Virus Detection”, Malware, and Vulnerability Assessment (DIMVA), pp.
Pattern Recognition Letters, 29(14), pp. 1941-1946, 88-107, Springer, France, 2008.
Elsevier, 2008. [19] M. Christodorescu, S. Jha, and C. Kruegal, “Mining
[5] AVG Free Antivirus, available at Specifications of Malicious Behavior”, European
http://free.avg.com/. Software Engineering Conference and the ACM
SIGSOFT Symposium on the Foundations of Software
[6] Panda Antivirus, available at
Engineering (ESEC/FSE 2007), pp. 5-14, Croatia, 2007.
http://www.pandasecurity.com/.
[20] Frans Veldman, “Heuristic Anti-Virus Technology”,
[7] M.G. Schultz, E. Eskin, E. Zadok, S.J. Stolfo, “Data
International Virus Bulletin Conference, pp. 67-76,
mining methods for detection of new malicious
USA, 1993, available at http://mirror.sweon.net/
executables”, IEEE Symposium on Security and
madchat/vxdevl/vdat/epheurs1.htm.
Privacy, pp. 38-49, USA, IEEe Press, 2001.
[21] Jay Munro, “Antivirus Research and Detection
[8] J.Z. Kolter, M.A. Maloof, “Learning to detect malicious
Techniques”, Antivirus Research and Detection
executables in the wild”, ACM SIGKDD International
Techniques, ExtremeTech, 2002, available at
Conference on Knowledge Discovery and Data Mining,
http://www.extremetech.com/article2/0,2845,
pp. 470-478, USA, 2004.
367051,00.asp.
[9] J. Kephart, G. Sorkin, W. Arnold, D. Chess, G.
[22] D.W. Aha, D. Kibler, M.K. Albert, “Instance-based
Tesauro, S. White, “Biologically inspired defenses
learning algorithms”, Journal of Machine Learning, Vol.
against computer viruses”, International Joint
6, pp. 37-66, 1991.
Conference on Artificial Intelligence (IJCAI), pp.
985-996, USA, 1995. [23] M.E. Maron, J.L. Kuhns, “On relevance, probabilistic
indexing and information retrieval”, Journal of the
[10] R.W. Lo, K.N. Levitt, R.A. Olsson, “MCF: A
Association of Computing Machinery, 7(3), pp.216-244,
malicious code filter”, Computers & Security,
1960.
14(6):541-566, Elseveir, 1995.
[24] Y. Freund, R. E. Schapire, “A decision-theoretic
[11] O. Henchiri, N. Japkowicz, “A Feature Selection and
generalization of on-line learning and an application to
Evaluation Scheme for Computer Virus Detection”,
boosting”, Journal of Computer and System Sciences,
IEEE International Conference on Data Mining
No. 55, pp. 23-37, 1997
(ICDM), pp. 891-895, USA, IEEE Press, 2006.
[25] J.R. Quinlan, “C4.5: Programs for machine learning”,
[12] P. Kierski, M. Okoniewski, P. Gawrysiak, “Automatic
Morgan Kaufmann, USA, 1993.
Classification of Executable Code for Computer Virus
Detection”, International Conference on Intelligent [26] I.H. Witten, E. Frank, “Data mining: Practical
Information Systems, pp. 277-284, Springer, Poland, machine learning tools and techniques”, Morgan
2003. Kaufmann, 2nd edition, USA, 2005.
[13] T. Abou-Assaleh, N. Cercone, V. Keselj, R. Sweidan. [27] VX Heavens Virus Collection, VX Heavens website,
“Detection of New Malicious Code Using N-grams available at http://vx.netlux.org
Signatures”, International Conference on Intelligent [28] J. Oberheide, E. Cooke, F. Jahanian. “CloudAV:
Information Systems, pp. 193-196, Springer, Poland, N-Version Antivirus in the Network Cloud”, USENIX
2003. Security Symposium, pp. 91-106, USA, 2008.
[14] J.H. Wang, P.S. Deng, “Virus Detection using Data [29] T. Fawcett, “ROC Graphs: Notes and Practical
Mining Techniques”, IEEE International Carnahan Considerations for Researchers”, TR HPL-2003-4, HP
Conference on Security Technology, pp. 71-76, IEEE Labs, USA, 2004.
Press, 2003. [30] S.D. Walter, “The partial area under the summary
[15] W.J. Li, K. Wang, S.J. Stolfo, B. Herzog, “Fileprints: ROC curve”, Statistics in Medicine, 24(13), pp.
identifying filetypes by n-gram analysis”, IEEE 2025-2040, 2005.
Information Assurance Workshop, USA, IEEE Press, [31] T.M. Cover, J.A. Thomas, “Elements of Information
2005. Theory”, Wiley-Interscience, 1991.