Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
7 views

Efficient and Effective Malware Detection System

The document presents a versatile malware detection framework that employs various machine learning algorithms to distinguish between malware and clean files while minimizing false positives. It highlights the importance of effective detection methods, especially for mobile devices, and discusses the integration of static and dynamic analysis techniques, including network traffic and system call analysis. The proposed system aims for a zero false positive rate and plans to incorporate additional classification algorithms in the future.

Uploaded by

jsanthoshithota
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Efficient and Effective Malware Detection System

The document presents a versatile malware detection framework that employs various machine learning algorithms to distinguish between malware and clean files while minimizing false positives. It highlights the importance of effective detection methods, especially for mobile devices, and discusses the integration of static and dynamic analysis techniques, including network traffic and system call analysis. The proposed system aims for a zero false positive rate and plans to incorporate additional classification algorithms in the future.

Uploaded by

jsanthoshithota
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Efficient and Effective Malware Detection System

Vinyas Raju
Department of Computer Science
The University of Texas at Dallas
Dallas, Texas, USA
Abstract

An important problem of cyber-security is malware analysis. Besides good precision and


recognition rate, ideally, a malware detection scheme needs to be able to generalize well for
novel malware families. It is important that the system does not require excessive computation
particularly for deployment on the mobile devices. We propose a versatile framework in which
one can employ different machine learning algorithms to successfully distinguish between
malware files and clean files, while aiming to minimize the number of false positives.

Introduction

A malware is a program which disrupts computer operation, gathers sensitive information or


gains access to private systems without users consent. With the ever increasing use of mobile
devices, mobile malware pose a significant threat because these devices store contacts, bank
account numbers, credit/debit numbers, private photos, messages and a lot of other sensitive
information that can be leaked. According to a recent report published jointly by Kaspersky
Labs and INTERPOL , 20% of devices that uses their software were attacked at least once by
malware. Their study Financial Cyberthreats in 2014 reports that the number of financial
malware attacks against Android users grew by 3.25 times in 2014. Given the tremendous
growth of Android malware, there is a pressing need for effective malware detection methods.
There exists a few static and dynamic malware detection methods reported in the literature.
While they convey valuable insight into malware research, the detection methods are not
precise enough. Various surveys are made on defining detection procedures or categorizing.
These researches propose permission analysis which is obsolete as permissions cannot be faked
in Android. The latter work suggests identifying specific “malicious” system call events which do
not occur except in malicious applications. But such combination of system calls may also occur
in legitimate applications, leading to false positive detection of malwares. Investigations on
MRAs or Malicious Repackaged Application like system call patterns in repackaged applications
[8] propose more fine-tuned approaches. The main objective of this work is to propose a
precise and quantitative approach that detects presence of malicious behavior in a random
android application by testing that app’s behavior against our findings on 1260 malwares and
227 “good” applications, and classifies that app as “good” or “bad” (malware). The
contributions of the work are: 1) performing network traffic analysis by capturing the URLs of all
remote locations that are contacted by applications for a specific period of time, then matching
against a database of known malicious domains; 2) conducting system call analysis by logging all
system calls made by applications from our dataset, then filtering and grouping the results by
their frequencies in both malicious and non-malicious applications; 3) validating these findings
on test applications to sort them in malware and non-malware categories mentioned after as
“bad” and “good” app, respectively. Our network analysis is a significant improvement on
application behavior confirmation methods. The result shows successful capture and
identification of malwares connecting to black listed domains. In the future attempts of finding
such event, the illustrated method can be implemented quite easily. Our System call based
approach outlines a threshold for acceptability of an android app (“good” or “bad” as
mentioned.
Malware detection is of paramount importance to our digital era and thus the daily life.A vast
numbers of metamorphic variants of malware for Windows and Android platforms were
developed. In addition to the volume of malware generated, novel families make the detection
task overwhelming. Malware detection is mostly based on static or/and dynamic analysis of
samples. Static analysis uses a binary file and/or disassembled code without running it. It is
quite efficient, in most cases, but has problems with heavy obfuscation. Dynamic analysis is a
better solution for obfuscated samples because it relies on the run-time behavior, but it is
computationally expensive, and the analysis might not see malicious behavior during testing.
Given features extracted, a classic method to detect malicious codes is to generate a signature
for every malware sample. The signature-based methods are only good for detecting known
malware.
The model is an integrated system in which static features are extracted from a binary file and
classified by a neural network. Although deep learning models can be this neural network, it is
computationally very expensive to use back propagation to learn a very large feature space. A
common solution for this situation is to use random projection techniques. The projected
feature space is then fed to the deep neural network. Random projection with our training
algorithm shows quite strong results, supported by a theoretical justification. Figure 1 present a
high level concept of the Malytics. Inspired from Natural Language Processing (NLP), the term-
frequency (tf) of the given binary file is multiplied by the random projection matrix including 1
and -1. The result is called tf-simhashing. This process is linear. The representation is fed to the
next stage/layer where the similarity indices are obtained as the input for classification. To
improve classification, generic non-linear feature scan be used. This can cause poor
generalization. A motivation that this paper uses Extreme Learning Machine (ELM) as the
supervised classifier is to address the generalization. We collected different datasets for our
experiments. Because the samples were collected in the wild, they could be a malware file or
malicious code that was imaged into another file. This setting helps test the model for real
world application. We cannot directly compare our model with the other work because we do
not have access to specific state-ofthe-art work datasets, except one Android dataset; however,
we think that in many cases similar datasets have been used. Our ground-truth for malware
samples is a collection of 19 well-known AV vendors. To evaluate Malytics a wide range of
experiments on Android and Windows samples is conducted. For Android, We propose to use
the Dex file rather than the raw APK. Our experiments show tf-simhashing of a dex file of an
APK carries more efficient information rather than the APK itself. Dex files are also smaller than
APKs. The results on Windows Portable Executable (PE) files show that the model is not
dependent on a particular operating system.

THE FEATURE EXTRACTION

Hashing is a computation which maps arbitrary size data into data of a fixed size. Hashing
algorithms have been widely used in the security application domain. Locality Sensitive Hashing
(LSH) is one of the main categories of hashing methods. It hashes input data so that similar data
maps to the same “buckets” with high probability, maximizing the probability of a “collision” for
similar inputs. Simhashing is one of the most widely used LSH algorithms, adopted to find
similar strings. Simhashing is an LSH that is designed to approximate the cosine similarity
between inputs. The main concept of simhashing comes from Sign Random Projections (SRP)
Given an input vector , SRP utilizes a random Gaussian unit vector I with each component
generated. Simhashing has wide-ranging applications from detecting duplicates in texts to
different security and to malware analysis, specifically with the Hamming distance similarity
measure. Inspired from NLP application domain, a n-gram is a contiguous sequence of n items
(here, a byte pair) from a given sequence of the binary file. The n-gram feature representation
is a specific type of bag-of-words representation in which only the number of occurrences of
the items is decisive and the location of the items in the binary file is neglected. The theory
behind simhashing allows us to weight the byte n-gram with the number of occurrences rather
than only representing presence of the byte n-gram in the file. The proposed feature
representation generates a fix size vector from an arbitrary size binary file. Given a binary file,
each n-gram is first hashed to a single fix size vector. To speed up this process, first a dictionary
of n-grams is provided and then, this vocabulary hashed to binary values. Having tf of the
vocabulary stored, each hash bit with value 1 or -1 is weighted with tf of the n-gram. Thus, tf is
inserted into the representation. In the next step, all the vectors sum up bit-wise, thereby
providing a final fix size vector. With this process, we embed the distribution of the n-grams of
bytes into the vector. This representation provides the two vectors that are close to each other
when two files have many common n-grams and different when the files have many different n-
grams.
THE PROPOSED SCHEME

Because the latent representation generated in the output of the first hidden layer is based on
the similarity of the original space, the second hidden layer of the proposed model can provide
a similarity measure. Indeed, we need a task-specific similarity over pairs of data points to
facilitate the prior knowledge (i.e. training samples in the first hidden layer). This similarity
measure followed by a linear predictor also yields a convex optimization problem. Kernel
methods can play this role. The relation between kernel machines and the neural network has
been widely investigated. Because kernel layer is data-dependent but unlabeled, the kernel
layer training could be seen as unsupervised. Figure 2 presents the proposed scheme. The
output layer weights are analytically obtained using the linear least squares technique. The
output layer is the ELM. The whole scheme has more than one hidden layer; thus, it is a deep
neural network. However, because the training does not use the back-propagation algorithm
for training, the scheme differs from the deep learning that is a well-known term in the
machine learning community. The kernel layer is a non-parametric and nonlinear model to
match the input to the templates that are obtained from the training samples. The Radial Basis
Function (RBF) kernel is well known for providing an infinite-dimensional kernel space and is
commonly used with the kernel trick. As we show later, our model supports the kernel trick, so
the RBF kernel is a logical choice.

CONCLUSION AND FUTURE WORK

Our main target was to come up with a machine learning framework that generically detects as
much malware samples as it can, with the tough constraint of having a zero false positive rate.
We were very close to our goal, although we still have a non-zero false positive rate. In order
that this framework to become part of a highly competitive commercial product, a number of
deterministic exception mechanisms have to be added. In our opinion, malware detection via
machine learning will not replace the standard detection methods used by anti-virus vendors,
but will come as an addition to them. Any commercial anti-virus product is subject to certain
speed and memory limitations, therefore the most reliable algorithms among those presented
here are the cascade one-sided perceptron and and its explicitly mapped variant. Since most
AntiVirus products manage to have a detection rate of over 90%, it follows that an increase of
the total detection rate of 3% − 4% as the one produced by our algorithms, is very significant.
As of this moment, our framework was proven to be a valuable research tool for the computer
security experts at BitDefender AntiMalware Research Labs. For the near future we plan to
integrate more classification algorithms to it, for instance large margin perceptrons and Support
Vector Machines.
REFERENCES

[1] Y. Zhou and X. Jiang, “Dissecting android malware: Characterization and evolution,” in IEEE
Symposium on Security and Privacy, San Francisco, CA, May 20-23, 2012, pp. 95–109.
[2] Kaspersky Lab and INTERPOL Survey Reports, http://media.kaspersky.com/pdf/Kaspersky-Lab-
KSN-Report-mobilecyberthreats-web.pdf , “Mobile cyber threats.”
[3]The Number of Financial Attacks Against Android Users Tripled in 2014,
http://www.kaspersky.com/about/news/virus/2015/The-Numberof-Financial-Attacks-Against-
Android-Users-Tripled-in-2014 .
[4] S. Ramu, “Mobile malware evolution, detection and defense,” in EECE 571B, TERM SURVEY
PAPER,, April, 2012.
[5] M. Chandramohan and H. B. K. Tan, “Detection of mobile malware in the wild,” IEEE Computer,
vol. 45, no. 9, pp. 65–71, Sep 2012.
[6] T. S. D. Iland, A. Pucher, “Detecting android malware on network level,” in Technical representation
UC Santa Barbara, 2012.
[7] Y. J. Ham and H. Lee, “Detection of malicious android mobile applications based on aggregated
system call events,” International Journal of Computer and Communication Engineering, vol. 3,
no. 2, pp. 340–350, March, 2014.
[8] C. C. H. T. Y. Lin, Y. Lai, “Identifying android malicious repackaged applications by thread-grained
system call sequences,” Computers and Security, vol. 39, pp. 340–350.
[9] T. Isohara, K. Takemori, and A. Kubota, “Kernel-based behavior analysis for android malware
detection,” in Internation Conference on Computational Intelligence and Security.
[10] M. Zaman, T. Siddiqui, M. Amin, and M. Hossain, “Malware detection in android by network traffic
analysis,” in Networking Systems and Security (NSysS), 2015 International Conference on, Jan
2015, pp. 1–5.
[11] Netstat command, http://en.wikipedia.org/wiki/Netstat.

You might also like