Defensedroid: A Modern Approach To Android Malware Detection
Defensedroid: A Modern Approach To Android Malware Detection
Defensedroid: A Modern Approach To Android Malware Detection
ABSTRACT
Lately, the matter of dangerous malware in devices is spreading speedily,
particularly the repackaged android malware. Though understanding Android malware
detection via dynamic analysis will give a comprehensive read, we need to also relate
between the app’s features and the features that are needed to deliver its category’s
functionality. Android is the most preferred openly available smartphone OS and its
permission declaration access management mechanisms can’t sight the behavior of
malware. The matter of investigating such malware presents distinctivechallenges thanks to
the restricted resources accessible and restricted privileges granted to the user. However
all the APK files conjointly present distinctive opportunities within the needed data hooked
up to every application. Our aim is to create an efficient system that curbs the threat of
android malware by correctly detecting and mitigating any malicious APKs via
combining permissions and API calls as features to characterize malware, and use
machine learning techniques to automatically extract patterns to differentiate benign and
malicious Apps. Through DefenseDroid, we tend to gift a machine learning-based system
for the detection of malware on android devices. DefenseDroid will effectively identify,
detect, categorize apps and safeguard android mobile devices from malicious apps thus
avoiding any stealing or misuse of the user’s data by using an easy user interface. In our
project, a code behavior signature-based malware detection framework mistreatment
associate degree LSTM rule is planned, which might sight malicious code and their
variants effectively in runtime and extend malware characteristics information
dynamically. Experimental results show that the approach incorporates a high detection
rate and low rate of false positive and false negatives, machine learning techniques are the
current methods to model patterns of static features and dynamic behaviors of Android
malware. Long Short-Term Memory (LSTM) networks are a modified version of recurrent
neuralnetworks, which makes it easier to remember past data in memory.
Keywords: Android, Malware analysis, Machine Learning, Neural Networks, Long Short-
Term Memory.
INTRODUCTION
Malware is basically a malicious program software or computer code, which may be
contained within a generic application, with a range of styles of hostile or intrusive computer
code like viruses, worms, spyware, Trojan horses, rootkits, and backdoors. A typical feature of
Malware is that it's specifically designed to wreck, disrupt, steal, or normally, impose
unhealthy or illegitimate actions. Malware will virtually infect any information processing
system running user programs (or applications), and also the propagation and bar of the
malware are thoroughly studied for understanding purposes. Especially for smartphone
devices, current solutions for locating malware within the mobile platform are way behind the
pace of the increasing quality of mobile applications. A recent report has shown that there are
more than 87 million mobile applications presently obtainable on the market. Research studies
in the Android malware detection field work in three approaches static, dynamic or hybrid. The
malware is principally distributed in markets operated by third parties, however even the
Google App Store cannot guarantee that each one of its listed applications area unit threat free.
The threats for users embody Phishing, Banking-Trojans, Spyware, Bots, Root Exploits, SMS
Fraud, Premium Dialers, and pretend Installers. There have conjointly been reports regarding
Download-Trojans Apps that transfer their malicious code when installation which suggests
that these Apps can't be simply detected by Google’s technology throughout publication within
the Google App Store. DefenseDroid makes Android devices safer and secure.
This paper discusses, portrays and focuses on a cloud based LSTM modeled active
learning framework for smartphone malware detection, and within our proposed system which
would to some extent validate the effectiveness of our strategy, tests show that the planned
methodology has sensible relevancy and measurability will be complete on a range of well-
liked malware observation and might detect unknown malware. Due to its less impact on
system performance, potentially significant impact on the initial system capability may go
unnoticed. In summary, malware applications normally use the subsequent 3 sorts of
penetration techniques for installation, activation, and running on the android device:
Repackaging among the rest of many is the foremost common techniques for malware
developers to put in malicious applications on a mechanical man platform. These sorts of
approaches commonly begin from well-liked legitimate Apps and misuse them as malware.
The developers commonly transfer well-liked Apps, take apart them, add their own malicious
codes, so re-assemble and transfer the new App to official or different markets. However,
changing the technique through which an application is made and maintained makes it tougher
for malware detection. Developers should still use repackaging however rather than enveloping
the impose code within app, they embody such specific update elements that may transfer
malicious code at runtime. Downloading is the ancient attack technique, malware developers
desire engaging users, to transfer fascinating and enticing Apps.
REVIEW OF LITERATURE
The initial studies on smartphone malware were chiefly targeted on understanding the
threats and behaviors of rising malware. There has been vital work on the matter of detecting
and eliminating malware on mobile devices. Many approaches monitor the usage of
applications and report abnormal behavior. Others monitor system calls and arrange to discover
uncommon system call patterns. Different approaches use additional ancient comparison with
acknowledged malware of different heuristics. Signatures primarily based ways, introduced
within the mid-90s area unit ordinarily employed in malware detection. The main weakness of
this kind of approach is its weakness in recognizing updated and unseen malware. Rather than
victimizing predefined signatures for malware detection, data processing and machine learning
techniques give a good thanks to dynamically extract malware patterns. For smartphone-based
mobile platforms, recent years have witnessed an increasing range of additional sophisticated
malware attacks mainly resorting to repackaging. Recent analysis consistently characterizes
existing harmful malware from varied aspects, together with their installation approaches,
activation mechanism moreover because the nature of carried malicious payloads. Researchers
have supported the analysis with four representative mobile security software packages which
have collected over 1200 cases of malware, their experiments show the weakness of current
malware detection solutions and need the necessity to develop next-generation anti-mobile-
malware solutions. One existing approach has used data processing and options generated from
windows workable API calls. They achieved sensible leads to a really giant scale dataset with
concerning 35,000 transportable workable files. Another activity foot printing methodology
additionally provides a dynamic approach to discover self-propagating malware. All these
existing ways have basically advanced generic malware detection, however the inaccurate
detection is still considered to be a valid present issue with respect to malware and thus
continuous and frequent changes of the signatures are required. Here lies the analysis gap.
Seo, Gupta, Sallam, Bertino, & Yim, 2014 proposed (DroidAnalyzer) that uses
permissions, dangerous APIs and keywords associated with malicious behaviors to detect
potential malicious scripts in Android apps [8]. Arp, Spreitzenbarth, Hubner, Gascon, & Rieck,
2014) proposed (DERBIN) alight weight static analysis framework that extracts a set of
features from the app’s AndroidManifest.xml (hardware components, requested permissions,
App components, and filtered intents) and disassembled code (restricted API calls, used
permissions, restricted API calls, network addresses) to generate a joint vector space [9]. Wu,
Mao, Wei, Lee, & Wu, 2012 proposed (Droidmat) that detects malware through analyzing
AndroidManifest.xml and tracing systems calls [10]. Sanz, Santos, Laorden, Ugarte-Pedrero, &
Bringas, 2012 proposed a machine learning method for automatic Android apps categorization
and malware detection [12].The machine learning algorithms were applied: Decision Trees
(DT), K-Nearest Neighbor (KNN), Bayesian Networks (BN), Random Forest (RF) and Support
Vector Machines (SVM). Sahs, & Khan, 2012 built up a system uses the extracted permissions
and the control flow graphs from benign apps to train one-class Support Vector Machines
(SVM) classifier [6].
Shabtai, Kanonov, Elovici, Glezer, & Weiss, 2012 proposed Artificial Neural Networks
(ANNs)-based system to detect unknown Android malware through analyzing the apps
permissions and system calls [5]. Zhou, Wang, Zhou, & Jiang, 2011 proposed (DroidRanger)
to detect known and unknown malware using two approaches: permission-based and
heuristics-based [7]. Ongtang, McLaughlin, Enck, & McDaniel, 2009 proposed Secure
Application Interaction (SAINT), an infrastructure to control granting the permissions to the
app at the install-time [11]. Also, SAINT controls how the app uses the permissions at runtime
for interaction with other interfaces of other apps, PKI, and the Android system.
DESCRIPTION
ANALYSIS
The User selects the APK file which is to be tested and it is sent to the cloud storage. Over
there the applications are stored in sequence and wait to be executed. The APK is then sent to
the Cloud Engine. The features of the APK such as permissions and API calls are extracted
and sent to the Machine Learning model. Thus, the features of the APK are analyzed and
based on the findings a report is generated and sent back to the User.
DESIGN
IMPLEMENTATION METHODOLOGY
Data Collection
We required a dataset in a binary vector format which was not directly available online hence
we searched and collected APKs for our dataset to train our model. We came across CIC
Dataset from University of Brunswick [2] which has data from 42 unique malware families.
Also CIC has a MalDroid dataset [4] which has around 17341 Android samples. We also
found RmvDroid dataset from Zenodo [13] which contained 45 GB of malicious APKs. We
got another dataset called Androzoo containing 14,739,915 APKs from Université du
Luxembourg. Lastly we got a lot of data from VirusShare.com Dataset [3] where all the
various APKs were scanned using different valid scanning software and a dataset is prepared
according to the results of the scans containing TBs of APKs.
Feature Extraction
Collect all applications in separate folders which contain benign as well as suspicious
applications respectively. Using “Glob” framework in python create an array of files is for
further processing. Analyze each application in the array using pyaxmlparser [19] and
Androguard [18] framework.
Malscan [20] is a framework which operates on API calls. It extracts the API Calls from .dex
files which is obtained by Androguard. It extracts and saves the API calls in .gexf file format.
It takes a set of sensitive API Calls commonly used by malicious apps and generates a CSV
file in vector format as previously used. Different CSVs are made for various centrality types
such as degree, katz and closeness.
Taking these four attributes into consideration a program maps all attributes to a CSV file
and mentions a class for each application. Once CSV files are generated, analyze them for
any redundancy present, and if found, eliminate the entire row. Another program extracts
the total permissions from these APK files. These permissions will work as attributes in
the Dataset CSV File (Here if permission is present it is marked as 1 else it is marked as
0). An N-bit Vector extracts search line in the CSV file, these vectors work as input to the
machine learning algorithm.
Classification Model
Import data in the form of CSV file using pandas framework. Using train test split divide entire
dataset in a ratio of 1:3 (75% of data is for training and 25% for testing). Design a model while
keeping the inputs in mind. Select ‘sigmoid activation function’ as input/output is binary. Analyze
accuracy using confusion matrix. This was the methodology we applied.
Our model was trained with a very efficient dataset and detailed specifications with a wide variety of
applications containing thousands of APKs. SVM is known for classification so it was our first
algorithm to go towards but during the testing we found out that the novel SVM approach had low
accuracy on real time applications.
Long Short Term Memory Network is an advanced RNN, a sequential network that allows
information to persist. LSTMs are explicitly designed to avoid long-term dependency problems.
LSTM can not only process single data points such as images, but also entire sequences of data such
as speech or video. Gated recurrent units (GRUs) are a gating mechanism in recurrent neural
networks, introduced in 2014 by Kyunghyun Cho et al. The GRU is like a long short-term memory
(LSTM) with a forget gate, but has fewer parameters than LSTM, as it lacks an output gate. GRU's
performance on certain tasks of polyphonic music modeling, speech signal modeling and natural
language processing was found to be similar to that of LSTM. We mainly compared these two
algorithms GRU and LSTM where we found LSTM to have an edge with respect to accuracy and
hence we preferred it as our prime algorithm for our model.
Workflow of DefenseDroid
The user on opening our app is shown all the applications installed on the device, from which the
options to share any app, select an app for a test as well as app info are made available. When the
user selects an app for testing, a dialog box emerges which asks the user If he/she wants to perform a
deep scan or not. In deep scan we analyze API Calls hence it takes more time so we give the user a
choice and according take the input. Once the user sends the desired application to test it, a backup is
taken and through firebase and it is uploaded to the GCP bucket. The next phase involves the server
receiving a request as a trigger message along with a token number which identifies the device from
which the App was sent while the App is stored on the GCP platform with a UID (Unique
Identification) along with the user’s scan choice as its name. Now the server conducts the analysis
via our trained model where the applied LSTM algorithm plays the main role. Here standard analysis
is done based on 3 factors, namely Permissions, Receivers and Services and deep analysis is done
based on API Calls.
When all this is done, a PDF is generated as per the analysis and is uploaded on the GCP bucket
which is named as ‘app_UID’ and this PDF is then sent to the device whose corresponding token
number was stored. The URL is requested from the firebase, then the PDF is downloaded by the
download module and a pop up notification is triggered on the user’s device. This completes the
entire process and sequence of steps taken by our application, DefenseDroid.
FUTURE SCOPE
Since the Deep scan analysis takes a while, we can optimize it by reframing the algorithm and
increase its efficiency and reduce time. Also we can include an advanced analysis using an Android
Sandbox environment. This would increase the scanning ability of DefenseDroid as well as it would
test the apps on a real time basis. In real time, some apps behave differently than they are actually
supposed to; hence a Sandbox would catch such apps and provide an added layer of assurity,
accuracy and protection. Another additional feature which can be added to improve the working and
efficiency of DefenseDroid is multithreading, which could enable users to simultaneously send
multiple or all apps for testing at one go.
CONCLUSION
Hence, we have successfully proposed to use permissions, receivers and services of
Android applications to detect malware and malicious codes in Android based mobile
platform. Ours is a novel approach to distinguish and detect Android malware with different
intentions. It is effective, that is, it is able to distinguish variant of Android malware between
distinct purposes of them. The proposed framework extracts permissions from Android
applications and further combines the API calls to characterize each application as a high
dimension feature vector. By applying learning methods to the collected datasets, we can
derive classification models to classify Apps as benign or malware. Experiments on real world
data demonstrate the good performance of the framework for malware detection.
APPENDIX
REFERENCES
[1] Androzoo Dataset, by Université du Luxembourg. https://androzoo.uni.lu/
[2] CIC AndMal 2017 https://www.unb.ca/cic/datasets/andmal2017.html
[3] VirusShare.com Dataset. https://virusshare.com/
[4] CIC MalDroid 2020.https://www.unb.ca/cic/datasets/maldroid-2020.html
[5] Shabtai, A., Kanonov, U., Elovici, Y. et al. “Andromaly”: a behavioral malware
detection framework for android devices. J Intell Inf Syst 38, 161–190 (2012).
https://doi.org/10.1007/s10844-010-0148-x
[6] Justin Sahs and Latifur Khan. 2012. A Machine Learning Approach to Android Malware
Detection. In Proceedings of the 2012 European Intelligence and Security Informatics
Conference (EISIC’12). IEEE Computer Society, USA, 141–147. DOI:
https://doi.org/10.1109/EISIC.2012.34
[7] Zhao M., Ge F., Zhang T., Yuan Z. (2011) AntiMalDroid: An Efficient SVM-Based
Malware Detection Framework for Android. In: Liu C, Chang J, Yang A (eds)
Information Computing and Applications. ICICA2011. Communications in Computer
and Information Science, vol 243. Springer, Berlin, Heidelberg. .
https://doi.org/10.1007/978-3-642-27503-622
[8] Seung-Hyun Seo, Aditi Gupta, Asmaa Mohamed Sallam, Elisa Bertino, Kangbin Yim,
Detecting mobile malware threats to homeland security through static analysis, Journal
of Network and Computer Applications, Volume 38, 2014, Pages 43-53, ISSN 1084-
8045,https://doi.org/10.1016/j.jnca.2013.05.008
[9] Arp, Daniel & Spreitzenbarth, Michael & Hubner, Malte & Gascon, Hugo & Rieck,
Konrad.(2014). DREBIN: Effective and Explainable Detection of Android Malware in
Your Pocket. Symposium on Network and Distributed System Security (NDSS).
10.14722/ndss.2014.23247
[10] D. Wu, C. Mao, T. Wei, H. Lee and K. Wu,” DroidMat: Android Malware Detection
through Manifest and API Calls Tracing,” 2012 Seventh Asia Joint Conference on
Information Security, Tokyo, 2012, pp. 62-69, DOI: 10.1109/AsiaJCIS.2012.18.
https://ieeexplore.ieee.org/document/6298136/
[11] William Enck, Machigar Ongtang, and Patrick McDaniel. 2009. on lightweight mobile
phone application certification. In Proceedings of the 16th ACM conference on Computer
and communications security CCS’09. Association for Computing Machinery, New
York, NY, USA, 235–245. DOI: https://doi.org/10.1145/1653662.1653691
[12] Sanz, Borja & Santos, Igor & Laorden, Carlos & Ugarte Pedrero, Xabier & Bringas,
Pablo. (2012). On the Automatic Categorization of Android Applications.
10.1109/CCNC.2012.6181075. https://ieeexplore.ieee.org/document/6181075
[13] H.Wang, J.Si, H.Li and Y.Guo, ”RmvDroid: Towards A Reliable Android Malware
Dataset with App Metadata,” 2019 IEEE/ACM 16th International Conference on Mining
Software Repositories (MSR), Montreal, QC, Canada, 2019, pp.404-408, DOI:
10.1109/MSR.2019.00067. https://ieeexplore.ieee.org/document/8816783
[14] X. Li, J. Liu, Y. Huo, R. Zhang and Y. Yao, ”An Android malware detection method
based on Android Manifest file,” 2016 4th International Conference on Cloud
Computing and Intelligence Systems(CCIS), Beijing, 2016, pp.239-243, DOI:
10.1109/CCIS.2016.7790261. https://ieeexplore.ieee.org/document/7790261
[15] Mohammed K. Alzaylaee ,Suleiman Y. Yerima, Sakir Sezer, DL-Droid: Deep Learning
Based Android Malware Detection Using Real Devices, Computers & Security (2019),
DOI:https://doi.org/10.1016/j.cose.2019.101663
[16] N.Peiravian and X.Zhu, ”Machine Learning for Android Malware Detection Using
Permission and API Calls, ”2013 IEEE 25th International Conference on Tools with
Artificial Intelligence, Herndon, VA, 2013, pp. 300-305, DOI:10.1109/ICTAI.2013.53.
https://ieeexplore.ieee.org/document/6735264
[17] The Complete Android Oreo Developer Course - Build 23 Apps! Created by Rob
Percival, Nick Walter. https://www.udemy.com/course/the-complete-android- oreo-
developer-course/
[18] Androguard: A Full Python Tool to play with Android files. .
https://androguard.readthedocs.io/en/latest/
[19] Pyaxmlparser: Python3 Parser for Android XML file. https://pypi.org/project/pyaxmlparser/
[20] MalScan: Fast Market-Wide Mobile Malware Scanning by Social-Network Centrality
Analysis. Yueming Wu, XiaoDi Li, Deqing Zou, Wei Yang, Xin Zhang, Hai J
http://youngwei.com/pdf/MalScan.pdf