Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
141 views

Detecting Malware Using Process Tree and Process Activity Data

This document proposes a method to detect malware using process tree and activity data. It assumes processes from the same application will have similar activity patterns, while processes from different applications will have more dissimilar patterns. The method involves collecting process tree and activity data from clean and malware-infected systems. It then uses distance measures to compare processes and detect those that deviate from known clean processes, indicating potential malware. An evaluation showed the algorithm could detect processes from two of three malware samples, with a true positive rate as high as 0.917. Future work could involve collecting all data on a single machine for better results.

Uploaded by

Nicolae Berendea
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
141 views

Detecting Malware Using Process Tree and Process Activity Data

This document proposes a method to detect malware using process tree and activity data. It assumes processes from the same application will have similar activity patterns, while processes from different applications will have more dissimilar patterns. The method involves collecting process tree and activity data from clean and malware-infected systems. It then uses distance measures to compare processes and detect those that deviate from known clean processes, indicating potential malware. An evaluation showed the algorithm could detect processes from two of three malware samples, with a true positive rate as high as 0.917. Future work could involve collecting all data on a single machine for better results.

Uploaded by

Nicolae Berendea
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

1

Detecting malware using process tree and process activity data


Krijn Wijnands
Faculty of Technology, Policy and Management
Delft University of Technology,
the Netherlands
Email: k.j.wijnands@student.tudelft.nl

Abstract—In the last few years malware is incurring more damage known malicious behavior and is not sufficient for detection
and has become more sophisticated. Current security solutions are still 0-day malware or exploits [7]–[9]. In the contrary anomaly
based on signature and know behavior based detection. This renders
them incapable of detecting new malware. In this paper we will present detection is more suitable for detecting 0-day malware and
an anomaly detection method based on the combined data from process exploits. The anomaly detection model is based on know
activities and process trees. We assume that processes from the same normal behavior and has the ability to detect deviations from
application show comparable process activities and different applications
show differences in the process activities. Using a distance measure on this known normal behavior [7]–[10]. However a disadvantage
the process characteristics, depth and cluster from process a and process of anomaly detection is the higher number of false positives
b, we can show which processes deviate from the known processes. The it generates [11] in comparison to misuse detection.
detection algorithm tries to minimize the distance for every process in the
datasets. Evaluation showed that the presented algorithm could detect In this paper we will present a novel anomaly detection
processes from two of the three malware samples used. The highest TPR method for malware based on combined data from process
gained was 0.917. For future research we would recommend using a data activities and process trees.
collection set-up in which all data is collected on one machine.
In the next section related work will be discussed. After
Index Terms—malware detection, process tree, process activities, the related work we will introduce the assumption on which
anomaly detection.
our presented algorithm is based. Section IV we will explain
what data is collected. The following section will provide an
I. I NTRODUCTION overview on how the data is collected, after which in section
ALWARE is a huge problem in today’s IT environment.
M And the predictions are these will incur more damage
and become more sophisticated [1], [2]. Headlines as ”En-
VI the data processing is explained. In section VII will explain
the novel detection algorithm, which shall be evaluated in
following section. This paper will end with the conclusion in
terprise bank accounts targeted in new malware attack” [3], section IX and recommendations for future research in section
”Hackers attack the energy industry with malware designed X.
for snooping” [4], ”Hackers exploit Flash in one of the largest
malware attacks in recent history” [5] are not uncommon and II. R ELATED W ORK
are all from the first eight months of 2015. According to [6]
the number of new malware samples discovered each year is
rising significantly, from around 80 million samples in 2013
I N [12] anomaly detection on Linux is done by using
process related information, which includes the relation-
ships among processes. This information is used to create a
up to 143 in 2014. graph showing the relations between: processes and processes,
With these huge numbers of new malware samples released, processes and programs and processes and system calls. Each
it is difficult for anti-virus vendors to keep up to speed with node in the tree consists of two parameters nameproc and
their protection against malware. The reason for this is that stadd . To be able to detect malicious behavior the distance
most security solutions are still based on a combination of between the stadd of the two nodes is calculated. Then the
Signature-based Detection and Sandboxing. In signature-based model is trained using a supervised SVM on randomly selected
detection hashes of known malware files are used to detect 75% of the dataset and evaluated on the remaining 25%. This
it on a computer. Sandboxing runs the executable with strict was repeated nine times, rendering an accuracy between 0.71
policies on the host, such that the executable has thinks it can and 0.87.
execute all its commands. The behavior of the executable will The concept of process trees for malware detection on Linux
be compared to know malicious behavior. machines is also used in [13]. However instead of system calls
From the above information it can be concluded that current the command line options are recorded.
security solutions are still based on detecting known malicious In [14] a anomaly detection is proposed by using process
behavior. This creates a head start for the malware developers properties from Windows systems. The process properties used
and the damage is done before the security vendors can update are: changes to Windows registry, changes to filesystem, in-
their list of known malicious behavior. fection of running processes, network activity and the starting
To solve this problem, a detection method should be used and stopping of Windows services.
that does not rely on known behavior and signatures of known In this paper we will extend the above presented work
malware. In current scientific literature a lot is written on by presenting a concept of anomaly detection for a single
detecting malicious behavior on computers or networks. The Windows host based on the use of process trees and process
main distinction in detection is made between misuse detection activity characteristics. The information of the process trees
and anomaly detection. Misuse detection is still based upon created, will be combined with the activity characteristics of
2

the processes. This will be done for malware free datasets, Therefor the collected malware datasets are much shorter,
as well datasets containing malware infections. Then we will ranging from 20 to 40 minutes, in comparison to the clean
construct three comparing methods to compare the constructed datasets. This might be of influence on the outcome of the
malware datasets against the clean datasets. The outcome evaluation.
of these comparisons will be used for detection malicious The malware datasets will be compared against all the clean
behavior. datasets. More on the evaluation set-up in section VIII

III. A SSUMPTION VI. DATA PREPARATION


The main assumption on which the presented method is The collected data will be aggregated such that each row of
based, is the that processes from the same application show the dataframe corresponds to a unique process id and contains
comparable process activities and different applications show node and edge information for the process tree. For every
differences in the process activities. If we compare the process process we will count how often which event type is triggered
activities of the processes from different applications against and divide this by the total running time of the process in
each other it will generate a higher distance than when seconds. This will provide us with events triggered per second
comparing processes from the same application. per process for each event type. This was done to eliminate the
fact that processes running for a long time will show high event
IV. DATASET D ESCRIPTION counts. To be able to compare the columns within a dataframe
The data used, is collected by an endpoint security applica- and between dataframes we normalize the data between 0 and
tion which can log low level process information on Windows 10, see equation 1. In which x is the value to be normalized,
machines and contains the following eight type of events that A and B are the minimum respectively maximum value of the
can be triggered by a process: variable to be normalized and a and b provide the range for
• filesystem the normalization. For our data a would be zero and b would
• registry be ten.
• process create
(x − A) ∗ (b − a)
(1)
• process exit (B − A)
• thread create To get the maximum and minimum possible values of the
• thread exit dataset, all collected data was combined together to normalize
• module load each column. The reason for normalizing the data is the fact
• object callback that filesystem and registry events, making up about 95% of
All the event types have the following common data: an the data, occur way more than a process create or thread create
unique process id, an unique id assigned by the endpoint event. By normalizing the data on each process activity column
security application, and a timestamp. The rest of the data we can easily identify high and low values.
contains event specific data. For example the filesystem event A row of the aggregated dataframe contains the following
contains information on what kind of filesystem action is per- variables: unique process id, filesystem, registry, process cre-
formed, e.g. a write or read action. The registry event contains ate, thread create, module load, ob, unique parent process
information on what registry key action was performed and on id, process executable path, parent process executable path.
which registry key. Of the data collected, about 85 to 90% are Where the second till seventh variable, filesystem till ob,
filesystem events, the registry takes another 8 to 10% of the represent the number of times this event is triggered per second
collected events. by the corresponding process. For example table I shows the
normalized number of times such an event type is triggered
V. DATA COLLECTION per second by the unique process 9999.
For our research we have collected four clean datasets
TABLE I: Example of events per second
containing each a full boot cycle. Two were collected during
unique filesystem registry process thread module ob
a normal working day and have a time span of around 7 hour process create create load
and 50 minutes. The other two clean datasets are of a duration id
9999 0.008845 0.00092 2.06e − 05 0.00669 0 0.00469
of less than an hour. Collecting of the data was done on a
employee’s workstation.
Due to security limitations the collection of the malware The process executable path is an tokenized string of the
data had to be done in a virtual machine. We tried to create location of the executable. This information will not be used
an identical environment as possible. For the creation of the in in the proposed detection method, however it provides
malware datasets three different types of malware were used. valuable information to check if a process belongs to the same
Namely a banking malware (Dridex), a Remote Access Trojan executable.
and a variant of Zeus malware. Each of these malware samples As stated in section III we expect that processes from the
was run in the VM whilst working behavior was simulated on same application tend to show the same process activity. If
the machine. Again due to security limitations, we were not we cluster the processes based on the six activity types these
able to do normal work on the machine for the risk of leaking processes will be in the same cluster. If a process shows
personal or company information. deviating process activities it will be assigned to another
3

• filesystem
• registry

• process create

8000
• thread create
• module load
Within groups sum of squares

6000


• ob
• depth
4000


• fit cluster

4) From the calculated distance matrix we select the min-


2000

● ●




● ● ● ●
imum distance present and assign the distance between
2 4 6 8 10 12 14 the process from the clean dataset and malware dataset to
Number of Clusters
the process from the malware dataset and set the distance
to NA in the distance matrix.
Fig. 1: K-means plot 5) repeat step 4 until all processes from the selected depth
have an distance assigned
cluster. A k-means clustering algorithm will be used for 6) repeat steps 2 to 5 until all depths are done
clustering as this is a widely used clustering algorithm in 7) Repeat steps 1 to 6 until the malware dataset is compared
anomaly and misuse detection [15]. To cluster the data, the to all clean datasets.
K-means ”Hartigan-Wong” algorithm is used on the clean The outcome of running this algorithm will create for every
datasets by minimizing the within-cluster sum of squares, see malware dataset four new dataframes containing a distance to
equation 2 [16]. In which i = 1, 2, ...n, with n defining the a process in the matching clean dataset.
number of events, so the number of processes in the dataset. To mark a process as malicious we will use a threshold
j is defined as j = 1, 2, ..., p, in which p is the number of value for the distance. If a process has a distance higher than
variables, so in our case six. x(k, j) is the mean of the variable the threshold value it will be marked malicious. The used
j of all elements in a cluster k. The k used will be eight as threshold values will be discussed in the next section.
the within group sum of squares does not decline that much As the usage of a programs can differ every day, comparing
more when selecting a greater k, see figure 1 a dataset in which program A is used to a dataset where pro-
p
n X
gram A is not used, will result in a high distance and therefor
might be marked as malicious. However when comparing to
X
Sum(k) = (x(i, j) − x(k, j))2 (2)
i=0 j=0
a dataset in which program A is used, the processes will have
a low distance. However if a malicious process is present, it
The found cluster centers will be used to assign the pro- will have a high distance to every dataset.
cesses of the malware datasets to their appropriate cluster Therefor a process in the malware dataset will only be
by selecting the cluster center with the lowest distance. For marked malicious, if it is above the set threshold in all four
calculating the distance the Euclidean distance will be used, comparison datasets. For example process i in the banking
see equation 3. The distance is calculated between two vectors malware dataset has a distance above the threshold value for
x and y with the dimension i [17, pp.509]. In this case the comparison with all the four clean dataset it will be marked
the dimensions are the eight variables mentioned above. A malicious.
distance matrix contains the distance of every combination of
processes between both datasets.
qX VIII. E VALUTION
(xi − yi )2 (3) To evaluate the present algorithm we will test if the mali-
cious process are marked as malicious by using six different
The data is now prepared to be tested by our detection
threshold values. The values used are the mean, 75%, 80%,
algorithm, which will be discussed in the next section.
85%, 90% and 95% quantile of the distances found in the
compared malware dataframe.
VII. D ETECTION ALGORITHM
As we know which processes are malicious we can calculate
This section will explain the algorithm used to compare the the True Positive Rate, the False Positive Rate and Accuracy,
malware datasets against the clean datasets. see the equations 4, 5 and 6. Where TP is True positive,
The algorithm can be described as follow: For every mal- malicious processes marked as malicious and FN is False
ware dataset these steps will be done: Negative, malicious processes marked as benign. FP are the
1) select a clean dataset benign processes marked as malicious, and True Negative are
2) For every depth present in the malware dataset we select the correctly marked benign processes.
the nodes in the malware dataset and clean dataset at the
selected depth, starting from depth 0. T P R = T P/(T P + F N ) (4)
3) At every depth a distance matrix will be calculated using
the Euclidean distance, equation 3, on the following
variables: F P R = F P/(F P + T N ) (5)
4

bank rat1 rat2 zeus1 zeus2 In the conducted research only one k value was tested, in
future research testing the impact of other numbers of k might
0.20

0.15
render a higher success rate.
FPR

0.10 The data was normalized between zero and ten. However
0.05 other normalization methods, such as Z-score, might render
different results.
0.75
In addition during analyzing the data we concluded that
TPR

0.50
some process perform a set number of actions, however
0.25
different running times due to using different machines, will
0.00
0.95
change the number of events per second. This will create other
0.90 characteristics for a process whilst it is performing exact the
ACC

0.85

0.80
same actions. Therefore further research should be conducted
0.75 on converting the number of events into a value that can be
mean q0.75 q0.8 q0.85
Threshold type
q0.9 q0.95 comparable.
Fig. 2: The FPR, TPR and ACC of the algorithm for all six threshold
The number of malware samples tested was low, to get a
types better insight in the performance of the presented algorithm it
should be tested on a larger amount of malware samples. The
problem hereby is that generating the data for the malware
samples is quite time consuming.
ACC = (T P + T N )/(T P + T N + F P + F N ) (6)
In figure 2 the TPR, FPR and ACC is shown for all the R EFERENCES
datasets on every threshold type. The values for TPR range
[1] “Security threat report 2014: Smarter, shadier, stealthier malware,”
from 0 tot 0.917 (on the banking malware), for FPR is between Report, Sophos, 2013. [Online]. Available: http://www.sophos.com/en-
0.013 and 0.232 and the ACC range from 0.728 up to 0.958. us/medialibrary/PDFs/other/sophos-security-threat-report-2014.pdf
The threshold type given the best TPR is the 75% quantile, [2] “Five predictions for information security and cybercrime in
2014,” http://www.theguardian.com/media-network/media-network-
however together with a rising the TPR the FPR will rise as blog/2013/dec/10/predictions-information-security-cybercrime-2014,
well and the ACC will go down. Dec. 2013, [Online; accessed 30-June-2014].
The presented algorithm was capable of detecting at least [3] “Enterprise bank accounts targeted in new malware attack,”
www.pcworld.com/article/2906056/enterpise-bank-accounts-targeted-in-
some of malicious processes from the banking and RAT new-malware-attack.html, April 2015, [Online; accessed 1-September-
malware, however it was incapable of detecting any of the 2015].
malicious processes from the Zeus malware. [4] “Hackers attack the energy industry with malware designed for snoop-
ing,” http://fortune.com/2015/03/31/spies-malware-energy-email, March
We analyzed the malicious processes from the Zeus malware 2015, [Online; accessed 1-September-2015].
to find out why it was not detected. The processes from the [5] “Hackers exploit flash in one of the largest malware attacks in re-
Zeus malware showed low values on the process activities. cent history,” https://bgr.com/2015/08/04/hackers-flash-yahoo-malware-
This might imply that the Zeus malware was only installed attack/, August 2015, [Online; accessed 1-September-2015].
[6] “Number of new malware per year,” http://www.av-
and started listing for a command from command and control test.org/en/statistics/malware/, [Online; accessed 15-january-2015].
center, but not receive any. The not receiving of any command [7] J. Song, H. Takakura, Y. Okabe, and K. Nakao, “Toward
might have to do with the fact that the collection of the a more practical unsupervised anomaly detection system,”
Information Sciences, vol. 231, no. 0, pp. 4 – 14, 2013,
malware datasets was a very short period. data Mining for Information Security. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0020025511004245
IX. C ONCLUSION [8] P. Casas, J. Mazel, and P. Owezarski, “Unsupervised
network intrusion detection systems: Detecting the unknown
In this paper we presented a novel anomaly detection without knowledge,” Computer Communications, vol. 35,
no. 7, pp. 772 – 783, 2012. [Online]. Available:
method for malware based on combined data from process http://www.sciencedirect.com/science/article/pii/S0140366412000266
activities and process trees. We explained what kind of data [9] R. Sommer and V. Paxson, “Outside the closed world: On using machine
was collected and which processing steps are taken. The learning for network intrusion detection,” in Security and Privacy (SP),
evaluation of the detection algorithm showed that it was 2010 IEEE Symposium on, May 2010, pp. 305–316.
[10] J. M. Harjinder Kaur, Gurpreet Singh, “A review of machine learning
capable of detecting malicious processes from two of the three based anomaly detection techniques,” International Journal of Computer
malware types. However a higher TPR give a higher FPR as Applications Technology and Research, vol. 2, no. 2, pp. 185 – 187,
well. 2013.
[11] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A
survey,” ACM Comput. Surv., vol. 41, no. 3, pp. 15:1–15:58, Jul. 2009.
X. R ECOMMENDATIONS [Online]. Available: http://doi.acm.org/10.1145/1541880.1541882
[12] C. Wagner, G. Wagener, R. State, and T. Engel, “Malware analysis with
The set-up for our data collection was, due to security graph kernels and support vector machines,” in Malicious and Unwanted
limitations, not ideal. For future research we would advise Software (MALWARE), 2009 4th International Conference on. IEEE,
to perform the same experiment with the data, clean and 2009, pp. 63–68.
[13] G. Wagener, A. Dulaunoy, T. Engel et al., “Self adaptive high interaction
malware, collected on the same machine. Hereby eliminating honeypots driven by game theory,” in Stabilization, Safety, and Security
any inconsistencies in the programs installed and used. of Distributed Systems. Springer, 2009, pp. 741–755.
5

[14] K. Rieck, T. Holz, C. Willems, P. Dssel, and P. Laskov, “Learning


and classification of malware behavior,” in Detection of Intrusions and
Malware, and Vulnerability Assessment, ser. Lecture Notes in Computer
Science, D. Zamboni, Ed. Springer Berlin Heidelberg, 2008, vol. 5137,
pp. 108–125.
[15] D.-K. Kang, D. Fuller, and V. Honavar, “Learning classifiers for misuse
and anomaly detection using a bag of system calls representation,” in
Information Assurance Workshop, 2005. IAW ’05. Proceedings from the
Sixth Annual IEEE SMC, June 2005, pp. 118–125.
[16] J. A. Hartigan and M. A. Wong, “Algorithm as 136: A k-means
clustering algorithm,” Journal of the Royal Statistical Society. Series C
(Applied Statistics), vol. 28, no. 1, pp. pp. 100–108, 1979. [Online].
Available: http://www.jstor.org/stable/2346830
[17] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical
Learning. Springer, 2009.

You might also like