Efficient and Effective Malware Detection System
Efficient and Effective Malware Detection System
Vinyas Raju
Department of Computer Science
The University of Texas at Dallas
Dallas, Texas, USA
Abstract
Introduction
Hashing is a computation which maps arbitrary size data into data of a fixed size. Hashing
algorithms have been widely used in the security application domain. Locality Sensitive Hashing
(LSH) is one of the main categories of hashing methods. It hashes input data so that similar data
maps to the same “buckets” with high probability, maximizing the probability of a “collision” for
similar inputs. Simhashing is one of the most widely used LSH algorithms, adopted to find
similar strings. Simhashing is an LSH that is designed to approximate the cosine similarity
between inputs. The main concept of simhashing comes from Sign Random Projections (SRP)
Given an input vector , SRP utilizes a random Gaussian unit vector I with each component
generated. Simhashing has wide-ranging applications from detecting duplicates in texts to
different security and to malware analysis, specifically with the Hamming distance similarity
measure. Inspired from NLP application domain, a n-gram is a contiguous sequence of n items
(here, a byte pair) from a given sequence of the binary file. The n-gram feature representation
is a specific type of bag-of-words representation in which only the number of occurrences of
the items is decisive and the location of the items in the binary file is neglected. The theory
behind simhashing allows us to weight the byte n-gram with the number of occurrences rather
than only representing presence of the byte n-gram in the file. The proposed feature
representation generates a fix size vector from an arbitrary size binary file. Given a binary file,
each n-gram is first hashed to a single fix size vector. To speed up this process, first a dictionary
of n-grams is provided and then, this vocabulary hashed to binary values. Having tf of the
vocabulary stored, each hash bit with value 1 or -1 is weighted with tf of the n-gram. Thus, tf is
inserted into the representation. In the next step, all the vectors sum up bit-wise, thereby
providing a final fix size vector. With this process, we embed the distribution of the n-grams of
bytes into the vector. This representation provides the two vectors that are close to each other
when two files have many common n-grams and different when the files have many different n-
grams.
THE PROPOSED SCHEME
Because the latent representation generated in the output of the first hidden layer is based on
the similarity of the original space, the second hidden layer of the proposed model can provide
a similarity measure. Indeed, we need a task-specific similarity over pairs of data points to
facilitate the prior knowledge (i.e. training samples in the first hidden layer). This similarity
measure followed by a linear predictor also yields a convex optimization problem. Kernel
methods can play this role. The relation between kernel machines and the neural network has
been widely investigated. Because kernel layer is data-dependent but unlabeled, the kernel
layer training could be seen as unsupervised. Figure 2 presents the proposed scheme. The
output layer weights are analytically obtained using the linear least squares technique. The
output layer is the ELM. The whole scheme has more than one hidden layer; thus, it is a deep
neural network. However, because the training does not use the back-propagation algorithm
for training, the scheme differs from the deep learning that is a well-known term in the
machine learning community. The kernel layer is a non-parametric and nonlinear model to
match the input to the templates that are obtained from the training samples. The Radial Basis
Function (RBF) kernel is well known for providing an infinite-dimensional kernel space and is
commonly used with the kernel trick. As we show later, our model supports the kernel trick, so
the RBF kernel is a logical choice.
Our main target was to come up with a machine learning framework that generically detects as
much malware samples as it can, with the tough constraint of having a zero false positive rate.
We were very close to our goal, although we still have a non-zero false positive rate. In order
that this framework to become part of a highly competitive commercial product, a number of
deterministic exception mechanisms have to be added. In our opinion, malware detection via
machine learning will not replace the standard detection methods used by anti-virus vendors,
but will come as an addition to them. Any commercial anti-virus product is subject to certain
speed and memory limitations, therefore the most reliable algorithms among those presented
here are the cascade one-sided perceptron and and its explicitly mapped variant. Since most
AntiVirus products manage to have a detection rate of over 90%, it follows that an increase of
the total detection rate of 3% − 4% as the one produced by our algorithms, is very significant.
As of this moment, our framework was proven to be a valuable research tool for the computer
security experts at BitDefender AntiMalware Research Labs. For the near future we plan to
integrate more classification algorithms to it, for instance large margin perceptrons and Support
Vector Machines.
REFERENCES
[1] Y. Zhou and X. Jiang, “Dissecting android malware: Characterization and evolution,” in IEEE
Symposium on Security and Privacy, San Francisco, CA, May 20-23, 2012, pp. 95–109.
[2] Kaspersky Lab and INTERPOL Survey Reports, http://media.kaspersky.com/pdf/Kaspersky-Lab-
KSN-Report-mobilecyberthreats-web.pdf , “Mobile cyber threats.”
[3]The Number of Financial Attacks Against Android Users Tripled in 2014,
http://www.kaspersky.com/about/news/virus/2015/The-Numberof-Financial-Attacks-Against-
Android-Users-Tripled-in-2014 .
[4] S. Ramu, “Mobile malware evolution, detection and defense,” in EECE 571B, TERM SURVEY
PAPER,, April, 2012.
[5] M. Chandramohan and H. B. K. Tan, “Detection of mobile malware in the wild,” IEEE Computer,
vol. 45, no. 9, pp. 65–71, Sep 2012.
[6] T. S. D. Iland, A. Pucher, “Detecting android malware on network level,” in Technical representation
UC Santa Barbara, 2012.
[7] Y. J. Ham and H. Lee, “Detection of malicious android mobile applications based on aggregated
system call events,” International Journal of Computer and Communication Engineering, vol. 3,
no. 2, pp. 340–350, March, 2014.
[8] C. C. H. T. Y. Lin, Y. Lai, “Identifying android malicious repackaged applications by thread-grained
system call sequences,” Computers and Security, vol. 39, pp. 340–350.
[9] T. Isohara, K. Takemori, and A. Kubota, “Kernel-based behavior analysis for android malware
detection,” in Internation Conference on Computational Intelligence and Security.
[10] M. Zaman, T. Siddiqui, M. Amin, and M. Hossain, “Malware detection in android by network traffic
analysis,” in Networking Systems and Security (NSysS), 2015 International Conference on, Jan
2015, pp. 1–5.
[11] Netstat command, http://en.wikipedia.org/wiki/Netstat.