Detecting Malware Using Process Tree and Process Activity Data
Detecting Malware Using Process Tree and Process Activity Data
Abstract—In the last few years malware is incurring more damage known malicious behavior and is not sufficient for detection
and has become more sophisticated. Current security solutions are still 0-day malware or exploits [7]–[9]. In the contrary anomaly
based on signature and know behavior based detection. This renders
them incapable of detecting new malware. In this paper we will present detection is more suitable for detecting 0-day malware and
an anomaly detection method based on the combined data from process exploits. The anomaly detection model is based on know
activities and process trees. We assume that processes from the same normal behavior and has the ability to detect deviations from
application show comparable process activities and different applications
show differences in the process activities. Using a distance measure on this known normal behavior [7]–[10]. However a disadvantage
the process characteristics, depth and cluster from process a and process of anomaly detection is the higher number of false positives
b, we can show which processes deviate from the known processes. The it generates [11] in comparison to misuse detection.
detection algorithm tries to minimize the distance for every process in the
datasets. Evaluation showed that the presented algorithm could detect In this paper we will present a novel anomaly detection
processes from two of the three malware samples used. The highest TPR method for malware based on combined data from process
gained was 0.917. For future research we would recommend using a data activities and process trees.
collection set-up in which all data is collected on one machine.
In the next section related work will be discussed. After
Index Terms—malware detection, process tree, process activities, the related work we will introduce the assumption on which
anomaly detection.
our presented algorithm is based. Section IV we will explain
what data is collected. The following section will provide an
I. I NTRODUCTION overview on how the data is collected, after which in section
ALWARE is a huge problem in today’s IT environment.
M And the predictions are these will incur more damage
and become more sophisticated [1], [2]. Headlines as ”En-
VI the data processing is explained. In section VII will explain
the novel detection algorithm, which shall be evaluated in
following section. This paper will end with the conclusion in
terprise bank accounts targeted in new malware attack” [3], section IX and recommendations for future research in section
”Hackers attack the energy industry with malware designed X.
for snooping” [4], ”Hackers exploit Flash in one of the largest
malware attacks in recent history” [5] are not uncommon and II. R ELATED W ORK
are all from the first eight months of 2015. According to [6]
the number of new malware samples discovered each year is
rising significantly, from around 80 million samples in 2013
I N [12] anomaly detection on Linux is done by using
process related information, which includes the relation-
ships among processes. This information is used to create a
up to 143 in 2014. graph showing the relations between: processes and processes,
With these huge numbers of new malware samples released, processes and programs and processes and system calls. Each
it is difficult for anti-virus vendors to keep up to speed with node in the tree consists of two parameters nameproc and
their protection against malware. The reason for this is that stadd . To be able to detect malicious behavior the distance
most security solutions are still based on a combination of between the stadd of the two nodes is calculated. Then the
Signature-based Detection and Sandboxing. In signature-based model is trained using a supervised SVM on randomly selected
detection hashes of known malware files are used to detect 75% of the dataset and evaluated on the remaining 25%. This
it on a computer. Sandboxing runs the executable with strict was repeated nine times, rendering an accuracy between 0.71
policies on the host, such that the executable has thinks it can and 0.87.
execute all its commands. The behavior of the executable will The concept of process trees for malware detection on Linux
be compared to know malicious behavior. machines is also used in [13]. However instead of system calls
From the above information it can be concluded that current the command line options are recorded.
security solutions are still based on detecting known malicious In [14] a anomaly detection is proposed by using process
behavior. This creates a head start for the malware developers properties from Windows systems. The process properties used
and the damage is done before the security vendors can update are: changes to Windows registry, changes to filesystem, in-
their list of known malicious behavior. fection of running processes, network activity and the starting
To solve this problem, a detection method should be used and stopping of Windows services.
that does not rely on known behavior and signatures of known In this paper we will extend the above presented work
malware. In current scientific literature a lot is written on by presenting a concept of anomaly detection for a single
detecting malicious behavior on computers or networks. The Windows host based on the use of process trees and process
main distinction in detection is made between misuse detection activity characteristics. The information of the process trees
and anomaly detection. Misuse detection is still based upon created, will be combined with the activity characteristics of
2
the processes. This will be done for malware free datasets, Therefor the collected malware datasets are much shorter,
as well datasets containing malware infections. Then we will ranging from 20 to 40 minutes, in comparison to the clean
construct three comparing methods to compare the constructed datasets. This might be of influence on the outcome of the
malware datasets against the clean datasets. The outcome evaluation.
of these comparisons will be used for detection malicious The malware datasets will be compared against all the clean
behavior. datasets. More on the evaluation set-up in section VIII
• filesystem
• registry
●
• process create
8000
• thread create
• module load
Within groups sum of squares
6000
●
• ob
• depth
4000
●
• fit cluster
●
● ●
●
●
●
●
● ● ● ●
imum distance present and assign the distance between
2 4 6 8 10 12 14 the process from the clean dataset and malware dataset to
Number of Clusters
the process from the malware dataset and set the distance
to NA in the distance matrix.
Fig. 1: K-means plot 5) repeat step 4 until all processes from the selected depth
have an distance assigned
cluster. A k-means clustering algorithm will be used for 6) repeat steps 2 to 5 until all depths are done
clustering as this is a widely used clustering algorithm in 7) Repeat steps 1 to 6 until the malware dataset is compared
anomaly and misuse detection [15]. To cluster the data, the to all clean datasets.
K-means ”Hartigan-Wong” algorithm is used on the clean The outcome of running this algorithm will create for every
datasets by minimizing the within-cluster sum of squares, see malware dataset four new dataframes containing a distance to
equation 2 [16]. In which i = 1, 2, ...n, with n defining the a process in the matching clean dataset.
number of events, so the number of processes in the dataset. To mark a process as malicious we will use a threshold
j is defined as j = 1, 2, ..., p, in which p is the number of value for the distance. If a process has a distance higher than
variables, so in our case six. x(k, j) is the mean of the variable the threshold value it will be marked malicious. The used
j of all elements in a cluster k. The k used will be eight as threshold values will be discussed in the next section.
the within group sum of squares does not decline that much As the usage of a programs can differ every day, comparing
more when selecting a greater k, see figure 1 a dataset in which program A is used to a dataset where pro-
p
n X
gram A is not used, will result in a high distance and therefor
might be marked as malicious. However when comparing to
X
Sum(k) = (x(i, j) − x(k, j))2 (2)
i=0 j=0
a dataset in which program A is used, the processes will have
a low distance. However if a malicious process is present, it
The found cluster centers will be used to assign the pro- will have a high distance to every dataset.
cesses of the malware datasets to their appropriate cluster Therefor a process in the malware dataset will only be
by selecting the cluster center with the lowest distance. For marked malicious, if it is above the set threshold in all four
calculating the distance the Euclidean distance will be used, comparison datasets. For example process i in the banking
see equation 3. The distance is calculated between two vectors malware dataset has a distance above the threshold value for
x and y with the dimension i [17, pp.509]. In this case the comparison with all the four clean dataset it will be marked
the dimensions are the eight variables mentioned above. A malicious.
distance matrix contains the distance of every combination of
processes between both datasets.
qX VIII. E VALUTION
(xi − yi )2 (3) To evaluate the present algorithm we will test if the mali-
cious process are marked as malicious by using six different
The data is now prepared to be tested by our detection
threshold values. The values used are the mean, 75%, 80%,
algorithm, which will be discussed in the next section.
85%, 90% and 95% quantile of the distances found in the
compared malware dataframe.
VII. D ETECTION ALGORITHM
As we know which processes are malicious we can calculate
This section will explain the algorithm used to compare the the True Positive Rate, the False Positive Rate and Accuracy,
malware datasets against the clean datasets. see the equations 4, 5 and 6. Where TP is True positive,
The algorithm can be described as follow: For every mal- malicious processes marked as malicious and FN is False
ware dataset these steps will be done: Negative, malicious processes marked as benign. FP are the
1) select a clean dataset benign processes marked as malicious, and True Negative are
2) For every depth present in the malware dataset we select the correctly marked benign processes.
the nodes in the malware dataset and clean dataset at the
selected depth, starting from depth 0. T P R = T P/(T P + F N ) (4)
3) At every depth a distance matrix will be calculated using
the Euclidean distance, equation 3, on the following
variables: F P R = F P/(F P + T N ) (5)
4
bank rat1 rat2 zeus1 zeus2 In the conducted research only one k value was tested, in
future research testing the impact of other numbers of k might
0.20
0.15
render a higher success rate.
FPR
0.10 The data was normalized between zero and ten. However
0.05 other normalization methods, such as Z-score, might render
different results.
0.75
In addition during analyzing the data we concluded that
TPR
0.50
some process perform a set number of actions, however
0.25
different running times due to using different machines, will
0.00
0.95
change the number of events per second. This will create other
0.90 characteristics for a process whilst it is performing exact the
ACC
0.85
0.80
same actions. Therefore further research should be conducted
0.75 on converting the number of events into a value that can be
mean q0.75 q0.8 q0.85
Threshold type
q0.9 q0.95 comparable.
Fig. 2: The FPR, TPR and ACC of the algorithm for all six threshold
The number of malware samples tested was low, to get a
types better insight in the performance of the presented algorithm it
should be tested on a larger amount of malware samples. The
problem hereby is that generating the data for the malware
samples is quite time consuming.
ACC = (T P + T N )/(T P + T N + F P + F N ) (6)
In figure 2 the TPR, FPR and ACC is shown for all the R EFERENCES
datasets on every threshold type. The values for TPR range
[1] “Security threat report 2014: Smarter, shadier, stealthier malware,”
from 0 tot 0.917 (on the banking malware), for FPR is between Report, Sophos, 2013. [Online]. Available: http://www.sophos.com/en-
0.013 and 0.232 and the ACC range from 0.728 up to 0.958. us/medialibrary/PDFs/other/sophos-security-threat-report-2014.pdf
The threshold type given the best TPR is the 75% quantile, [2] “Five predictions for information security and cybercrime in
2014,” http://www.theguardian.com/media-network/media-network-
however together with a rising the TPR the FPR will rise as blog/2013/dec/10/predictions-information-security-cybercrime-2014,
well and the ACC will go down. Dec. 2013, [Online; accessed 30-June-2014].
The presented algorithm was capable of detecting at least [3] “Enterprise bank accounts targeted in new malware attack,”
www.pcworld.com/article/2906056/enterpise-bank-accounts-targeted-in-
some of malicious processes from the banking and RAT new-malware-attack.html, April 2015, [Online; accessed 1-September-
malware, however it was incapable of detecting any of the 2015].
malicious processes from the Zeus malware. [4] “Hackers attack the energy industry with malware designed for snoop-
ing,” http://fortune.com/2015/03/31/spies-malware-energy-email, March
We analyzed the malicious processes from the Zeus malware 2015, [Online; accessed 1-September-2015].
to find out why it was not detected. The processes from the [5] “Hackers exploit flash in one of the largest malware attacks in re-
Zeus malware showed low values on the process activities. cent history,” https://bgr.com/2015/08/04/hackers-flash-yahoo-malware-
This might imply that the Zeus malware was only installed attack/, August 2015, [Online; accessed 1-September-2015].
[6] “Number of new malware per year,” http://www.av-
and started listing for a command from command and control test.org/en/statistics/malware/, [Online; accessed 15-january-2015].
center, but not receive any. The not receiving of any command [7] J. Song, H. Takakura, Y. Okabe, and K. Nakao, “Toward
might have to do with the fact that the collection of the a more practical unsupervised anomaly detection system,”
Information Sciences, vol. 231, no. 0, pp. 4 – 14, 2013,
malware datasets was a very short period. data Mining for Information Security. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0020025511004245
IX. C ONCLUSION [8] P. Casas, J. Mazel, and P. Owezarski, “Unsupervised
network intrusion detection systems: Detecting the unknown
In this paper we presented a novel anomaly detection without knowledge,” Computer Communications, vol. 35,
no. 7, pp. 772 – 783, 2012. [Online]. Available:
method for malware based on combined data from process http://www.sciencedirect.com/science/article/pii/S0140366412000266
activities and process trees. We explained what kind of data [9] R. Sommer and V. Paxson, “Outside the closed world: On using machine
was collected and which processing steps are taken. The learning for network intrusion detection,” in Security and Privacy (SP),
evaluation of the detection algorithm showed that it was 2010 IEEE Symposium on, May 2010, pp. 305–316.
[10] J. M. Harjinder Kaur, Gurpreet Singh, “A review of machine learning
capable of detecting malicious processes from two of the three based anomaly detection techniques,” International Journal of Computer
malware types. However a higher TPR give a higher FPR as Applications Technology and Research, vol. 2, no. 2, pp. 185 – 187,
well. 2013.
[11] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A
survey,” ACM Comput. Surv., vol. 41, no. 3, pp. 15:1–15:58, Jul. 2009.
X. R ECOMMENDATIONS [Online]. Available: http://doi.acm.org/10.1145/1541880.1541882
[12] C. Wagner, G. Wagener, R. State, and T. Engel, “Malware analysis with
The set-up for our data collection was, due to security graph kernels and support vector machines,” in Malicious and Unwanted
limitations, not ideal. For future research we would advise Software (MALWARE), 2009 4th International Conference on. IEEE,
to perform the same experiment with the data, clean and 2009, pp. 63–68.
[13] G. Wagener, A. Dulaunoy, T. Engel et al., “Self adaptive high interaction
malware, collected on the same machine. Hereby eliminating honeypots driven by game theory,” in Stabilization, Safety, and Security
any inconsistencies in the programs installed and used. of Distributed Systems. Springer, 2009, pp. 741–755.
5