Android Based Malware Detection Technique Using Machine Learning Algorithms
Android Based Malware Detection Technique Using Machine Learning Algorithms
Abstract—The threat landscape has drastically become address these challenges, researchers are using machine
immense due to the increasing number of Android devices learning (ML) algorithms to identify and handle malware on
and applications. Android malware detection is an area of Android devices. ML techniques offer a dynamic approach
research that has seen increased importance due to the by learning from data patterns that improve detection
limitations of the traditional detection technique, which accuracy and reduce false alarms.
mainly relies on signature scanning, considering the
rapidly changing malware nature. The paper aims to Additionally, integrating feature selection and reduction
assess the efficiency of machine-learning techniques in methods has significantly boosted the efficiency of these ML
augmenting the detection and identification of Android models. Approaches, like forward selection, exhaustive
malware. A comprehensive framework is proposed which subset selection, and community-based feature reduction
relies on the use of machine learning (ML) algorithms for have been explored to optimize the feature set and enhance
analyzing and identifying malicious apps. The framework the algorithm's detection capabilities Innovations nowadays
works on the database having a wide range of features also involve leveraging learning models such as
extracted from Android applications, including Convolutional Neural Networks (CNNs) which have shown
Permission Features, System Features, Security-related high accuracy in detecting malware by using advanced
Features, Communication Features, Data Access pattern recognition and data abstraction skills.
Features, App Lifecycle Features, Device Control
Features, and Miscellaneous Features. Models’ Recent studies have shown that ML-based approaches are
performance is assessed using different metrics like highly effective in detecting and learning patterns of
accuracy, precision, recall, and F1-score, ROC AUC. The malicious behavior to counter new types of malware.
results show that most models achieve high detection rates According to such research articles, learning methods allow
with minimal false positives. Random Forest and ML techniques to detect patterns with a relatively low rate of
XGBoost emerged as the top performers, achieving false positives. Analysis indicates that ML techniques
perfect scores of 1.0000 across all metrics. significantly outperform traditional approaches. The
aforementioned results are aligned with the experimental
Keywords—Android malware, machine learning, findings suggesting that ML-based models may be an
malware detection, feature selection efficient solution for creating an actual time of malware
detection for Android.
I. INTRODUCTION
The rapid rise of Android devices and applications also Given the evolution potential of malware in the future,
stems from increasing security issues. Android malware, in embedding these next-generation detection techniques into a
particular, has become a prevalent and growing issue. The OS lightweight, on-device framework to ensure end-users
being a prime target for malicious attacks, Android still receive continuous and non-intrusive protection will be key.
occupies about 71.31% of the mobile market share as of April The future will also involve further iterations of this research
2024, meaning that it is also a profitable target for to fine-tune these ML models for efficiency and scalability to
cybercriminals [1]. Existing signature-based methods are ensure that they can combat novel threats in the mobile
insufficient due to the greater complexity and novelty of security dynamics where they operate. The work should also
threats including obfuscation and zero-day attacks [2] [3], account for the resource limitations of mobile devices. All
when attackers develop malware specifically to combat trade-offs should be optimized to achieve the highest-
signature-based protection, rendering it useless, even performing detection while minimizing disruption to device
preventing detection before the malware spreads. performance and battery life.
Traditional methods for detecting malware like signature- In this paper, we present a comprehensive framework that
based approaches are becoming less effective against unites multiple machine learning algorithms such as such as
growing threats because they rely on known malware Logistic Regression (LR), Support Vector Machine (SVM),
signatures that may fail to detect updated malware. To Decision Trees (DT), k-Nearest Neighbor (k-NN), Naive
610
Bayes (NB), Neural Networks (NN), Random Forest (RF), al., 2024) [13] are at the forefront of research on detecting
Gradient Boosting Machine (GBM), AdaBoost (AdaB)), Android malware.
Bagging Classifier (BC), and XGBoost (XGB) for to identify
and classify Android malware. We leverage a diverse set of Gawales survey (2019) showcased the effectiveness of
features extracted from Android applications, including utilizing multiple classifiers in tandem for more precise
permissions, system settings, security-related activities, malware detection. By combining the strengths of different
communication functions, data access points, app lifecycle classifiers, this approach mitigates the weaknesses of
events, and device-controlling mechanisms that collectively individual models, leading to more reliable detection
create a complete picture of the application’s behavior and outcomes [14]. Abdullah and Hadi (2024) investigated the
benefits detection process. effectiveness of various ML and deep learning models
particularly CNN-GRU and enhanced the model's parameters
II. LITERATURE REVIEW throughout the training phase by applying techniques like
gradient descent and backpropagation [15].
Detecting Android malware has become a focus of
research due to the rise in malicious apps targeting the Faez Mahdi and Jasim (2024) reviewed their potential to
Android platform. Some study offers an in-depth review of recognize intricate malware patterns that conventional
Android malware detection methods, including static, methods may miss. The authors underscored that AI-driven
dynamic, machine learning, and deep learning approaches strategies can better adapt to malware variations underscoring
also highlight current challenges and suggest future research their significance in contemporary malware detection tactics
opportunities in the field. [16]. Specialized deep-learning models have been developed
to tackle the complexities of identifying Android malware.
Sharma and Kaul’s (2023) reviewed comprehensive Aamir et al. (2024) introduced the AMDDL model, a deep
examination of various machine learning techniques learning framework tailored for Android smartphones, the
employed for detecting malware in Android devices, AMDDL model performs better than existing models in terms
highlighting advancements, challenges, and future directions of Android malware detection [17]. Chandra et al. (2024)
in the field [4]. .Bhattacharya and Goswami (2018) explored the application of ML techniques in detecting
introduced a community-based feature selection technique Trojans, highlighting the significance of selecting the features
that significantly boosted detection accuracy by identifying and training models. Their research indicates that combining
the features from large datasets [5]. Likewise, Mahindru et al. static and dynamic analysis with ML can enhance Trojan
(2024) created the PermDroid framework, which utilizes a detection, reducing the risk of false negatives [18].
novel feature selection strategy along with ML to identify
Android malware. Their findings showed that effective Furthermore combining methods in hybrid models has
feature selection could result in quicker detection systems [6]. proven to be beneficial for detecting Android malware more
efficiently. Lee Yam et al. (2022) introduced a hybrid model
Behavior-based detection approaches concentrate on that merges static and dynamic analysis with ML algorithms.
monitoring the real-time behavior of applications to detect This approach showed increased detection accuracy by
activities. Vanjire and Lakshmi (2021) suggested a behavior- leveraging the strengths of both static and dynamic methods,
based malware detection system that uses ML to analyze app making it more resistant to malware attacks [19].
behavior providing robust detection even against obfuscated
malware [7]. Hybrid methods that merge static, dynamic, and Another study by S.T. Et al. (2023) delved into
machine learning strategies have garnered interest, for their implementing end-to-end ML systems for spotting malware.
capacity to capitalize on the advantages of each approach. They highlighted the obstacles when deploying ML models
Chimeleze et al. (2022) introduced BFEDroid, a hybrid in real-world settings such as computational efficiency and
feature selection method that boosts detection accuracy by resource constraints [20].
integrating multiple analysis methods [8].
Atacak (2023) proposed an ensemble method that
Recent advancements in machine learning such as deep combines multiple ML classifiers using fuzzy logic to handle
learning (DL) and ensemble methods have significantly uncertainty and imprecision in data. Additionally, the fusion
enhanced Android malware detection. Zhou et al. (2024) of machine learning techniques in hybrid models has proven
proposed MTDroid, a Moving Target Defense-based Android effective in boosting the capabilities of malware detection
malware detector that utilizes DL techniques to combat systems [21]. Sai Sriraj et al. (2023) presented a hybrid
evasion tactics. This strategy adjusts to variations in malware approach that integrates various ML techniques. Their model
behavior increasing its resilience against attacks [9]. demonstrated improved performance in terms of accuracy
Additionally, Baghirov (2023) emphasized the significance and detection speed, underscoring the potential of hybrid
of selecting the appropriate machine-learning algorithm for methods in modern malware detection [22].
detection tasks to optimize performance [10].
The research in this field, exploring the detection of
Cutting-edge strategies like incorporating co-features Android malware through ML methods has shown
(Odat & Yaseen 2023) [11], analyzing texture features advancements in presenting a range of approaches and
(Sharma & Rattan 2022) [12], and employing hybrid models models that deliver accuracy and effectiveness. These
that combine multiple machine learning algorithms (Neil et research endeavors underscore the importance of enhancing
611
Authorized licensed use limited to: Beijing University of Chemical Technology. Downloaded on November 03,2024 at 11:13:02 UTC from IEEE Xplore. Restrictions apply.
2024 First International Conference on Pioneering Developments in Computer Science & Digital Technologies (IC2SDT)
detection strategies to combat the changing landscape of For this research, the "Android Malware Detection
malware threats targeting Android devices. Dataset” was obtained from the Kaggle open-source
repository [23]. Android Malware Detection Dataset is a
The unique aspect of this study is its examination of large-scale dataset that can be used to detect and analyze
machine-learning models designed specifically for detecting Android malware. The dataset comprises a vast number of
Android malware. In contrast to previous studies that may meaningful and relevant features extracted from Android
have concentrated on a limited number of models or applications: features related to permissions requested by
assessment criteria, this study encompasses a diverse array of apps including such areas as access to location, camera,
models, such as Random Forest, XGBoost, Logistic contacts, and others; features associated with system
Regression, Neural Networks, GBM, AdaBoost, Decision functions and controls that comprise hardware access, system
Trees, and k NN. Additionally, the research includes an in- settings, and others; security-related behaviors features that
depth analysis of the time taken for execution and memory merge behaviors associated with authentication, encryption,
usage offering insights into how these models can be applied and others; communication features that merge behaviors
in real world scenarios. associated with SMS, network connections, and others; data
access and manipulation features that include storage,
III. METHODOLOGY databases, and others; app life cycle features that include
installation, updates, and others; device controls, such as
This study evaluates the effectiveness of different ML audio settings and display management; and other features
methodologies in detecting malware in Android devices. The like system logs, system events.
research methodology is based on using machine learning to
detect malware in Android applications. Fig. 1. below shows The dataset contains permissions extracted from over
the whole process of our procedure that was executed with 4464 instances for benign & malware Android apps having a
the help of the proposed malware analysis system. First, we total of 328 features extracted from Android applications.
have collected a full data set of Android applications from The dataset allows to implement and enhance malware
authentic open sources. Then, this data has been preprocessed detection methods and techniques that can be a valuable
by cleaning the dataset and preparing the dataset that can be addition to a range of existing mobile security measures on
used for evaluation purposes. By going through the the Android platform. The dataset is available in .csv format
preprocessing, the dataset is cleaned, standardized, and well- for further processing.
suited for training the ML model. After that, the processed
dataset was split into two parts – training and testing. The 2. Data Preprocessing:
training part is used to train the ML model and the testing one
Data preprocessing is required for preparing data for
helps to test the model for unseen malware. By tuning and
evaluation tasks in machine learning specifically training and
evaluating the model, researchers have ensured that it can
testing for Android malware detection. In this research paper
successfully distinguish between benign and malicious
here, we used the Android Malware Detection Dataset”
applications. Further exploration might explore the
which was obtained from the Kaggle having a total of 4464
introduction of user input or actual use cases in the
instances with 328 features extracted from Android
implementation of the malware scanner.
applications in which 1931 instances were labeled as
malicious. And another subset of 2533 instances labeled as
benign. No identical columns were found in the dataset; also,
no null values or missing values were detected. There were
no empty rows identified. To address the imbalance in our
dataset, we used a hybrid resampling approach to increase the
performance of ML techniques. In particular, the SMOTE-
ENN resampling approach was employed to obtain a
balanced dataset, after applying this method, we obtained a
dataset with 2,229 instances labeled as malicious and 2,072
instances labeled as benign. The class distribution of the
dataset i.e. original and after SMOTE-ENN is illustrated in
Fig. 2.
612
Authorized licensed use limited to: Beijing University of Chemical Technology. Downloaded on November 03,2024 at 11:13:02 UTC from IEEE Xplore. Restrictions apply.
2024 First International Conference on Pioneering Developments in Computer Science & Digital Technologies (IC2SDT)
The key features were chosen utilizing the Extra tree LR 0.9977 0.9977 0.9977 0.9977 0.9977 0.1409 680.2773
classifier. The notable features encompass as shown in Fig. SVM 0.9954 0.9954 0.9954 0.9954 0.9954 0.4314 680.2773
3. DT 0.9884 0.9884 0.9884 0.9884 0.9885 0.0283 680.2773
k-NN 0.9803 0.9805 0.9803 0.9802 0.9801 0.0965 680.2773
RF 1 1 1 1 1 0.5299 680.2773
The whole dataset after feature selection was divided into Fig. 4. ROC Curve Comparison for Different ML Algorithms
two parts, 80% of the data set was used for training the model
and the other 20% for validation. Afterward, multiple ML In this research, we examined various models in our
techniques such as LR, SVM, DT, k-NN, NB, NN, RF, GBM, extensive research on Android-based malware detection
AdaBoost, Bagging Classifier, and XGBoost were used to using machine learning and recorded them in detail through
train the data. The performance of each model was evaluated several comprehensive performance metrics. Random Forest
on unseen test data using the standard evaluation metrics. At and XGBoost emerged as the top performers, achieving
last, depending on all set parameters the trained-tested model perfect scores of 1.0000 across all metrics. Other strong
effectiveness was calculated. Based on these parameters, the contenders including Logistic Regression, Neural Networks,
effectiveness of the trained and tested model was evaluated. Gradient Boosting Machines (GBM), and AdaBoost also
After evaluation, the model can be used to analyze new demonstrated accuracy with scores of 0.9977 and with
applications and predict whether they are malicious or precision, recall, and F1 score scores of 0.9977 and excellent
benign. ROC AUC values of 0.9977, these models continued to
perform well and demonstrated their reliable ability to
To establish this framework, we use a system with 16 GB distinguish between classes. Support Vector Machines
of RAM, an 8th Generation Intel i5 processor, and 2 GB of (SVM) and Bagging Classifiers performed well but slightly
Graphics Card memory. Finally, the Python code is to be below the models mentioned earlier. Naive Bayes showed
prepared that uses known Python libraries like pandas, performance with an accuracy of 0.9443 and a ROC AUC of
seaborn, numpy, matplotlib, scipy, and sklearn, etc 0.9450. The execution times varied greatly across the models
with Decision Trees and Naive Bayes being the fastest at
IV. RESULT AND DISCUSSION 0.0283 seconds and 0.0145 seconds respectively, making
them suitable for applications requiring quick results.
TABLE I summarizes the performance of different Although, Random Forest and XGBoost had execution times
machine learning models on the provided dataset. The of 0.5299 seconds and 2.0366 seconds respectively their
performance is measured using the different evaluation impeccable accuracy made them choices, for scenarios where
metrics as shown below: Accuracy, Precision, Recall, F1 performance is paramount. Neural networks despite
Score, ROC AUC, Execution Time, and Memory Usage. The achieving accuracy took the longest execution times at
ROC Curve illustrating the performance of these ML 3.3911 seconds. All models showed memory usage with each
techniques is depicted in Fig. 4. using 680.2773 MB.
613
Authorized licensed use limited to: Beijing University of Chemical Technology. Downloaded on November 03,2024 at 11:13:02 UTC from IEEE Xplore. Restrictions apply.
2024 First International Conference on Pioneering Developments in Computer Science & Digital Technologies (IC2SDT)
614
Authorized licensed use limited to: Beijing University of Chemical Technology. Downloaded on November 03,2024 at 11:13:02 UTC from IEEE Xplore. Restrictions apply.
2024 First International Conference on Pioneering Developments in Computer Science & Digital Technologies (IC2SDT)
execution times, are ideal for scenarios where accuracy and [10] E. Baghirov, “Evaluating the Performance of Different Machine
Learning Algorithms for Android Malware Detection,” 2023 5th
predictive performance are paramount.
International Conference on Problems of Cybernetics and
Informatics (PCI). IEEE, Aug. 28, 2023. doi:
Furthermore, the research indicates memory usage 10.1109/pci60110.2023.10326006.
[11] E. Odat and Q. M. Yaseen, “A Novel Machine Learning Approach
among all models ensuring that memory constraints do not for Android Malware Detection Based on the Co-Existence of
significantly impact model selection. This consistency in Features,” IEEE Access, vol. 11. Institute of Electrical and
resource utilization enables these models to be adaptable Electronics Engineers (IEEE), pp. 15471–15484, 2023. doi:
across environments, including mobile devices and edge 10.1109/access.2023.3244656.
[12] T. Sharma and D. Rattan, “Visualizing Android Malicious
computing platforms. One of the main challenges identified Applications Using Texture Features,” International Journal of
in this study is the trade-off between execution time and Image and Graphics, vol. 23, no. 06. World Scientific Pub Co Pte
accuracy. While models like Neural Networks provide high Ltd, Aug. 29, 2022. doi: 10.1142/s0219467823500523.
accuracy, their longer execution times may not be suitable for [13] A. M. Neil, E. Shabaan, M. El Qout, and K. Emara, “Machine
Learning Based Approaches For Android Malware Detection using
all applications. Hybrid Feature Analysis,” 2024 6th International Conference on
Computing and Informatics (ICCI). IEEE, Mar. 06, 2024. doi:
Future research could focus on optimizing 10.1109/icci61671.2024.10485163.
[14] M. R. Gawale, “Survey on Android Malware Detection Using
hyperparameters of the top-performing models to achieve Multilevel Classifier Fusion,” International Journal for Research in
higher levels of accuracy and at the same time lower Applied Science and Engineering Technology, vol. 7, no. 3.
execution times. Different types of ensemble learning could International Journal for Research in Applied Science and
enhance the prediction performance and robustness of models Engineering Technology (IJRASET), pp. 1858–1862, Mar. 31,
2019. doi: 10.22214/ijraset.2019.3346.
to unseen data. Research on the scalability and deployment of [15] K. M. Abdullah and A. A. Hadi, “Exploring the Effectiveness of
models could determine the model’s performance on Machine and Deep Learning Techniques for Android Malware
distributed systems or edge devices. The interpretability of Detection,” Journal of Image Processing and Intelligent Remote
model performance, especially with complex models like Sensing, no. 42. HM Publishers, pp. 1–10, Feb. 01, 2024. doi:
10.55529/jipirs.42.1.10.
Neural Networks and ‘GBM’. Testing these models on other [16] M. Faez mahdi and S. Saadoon Jasim, “Mobile based Malware
domain-specific datasets to test generalizability. Research on Detection using Artificial Intelligence Techniques a review,”
model vulnerability to adversarial attacks would increase Journal of Al-Qadisiyah for Computer Science and Mathematics,
confidence on the level of deployment and use. Developing vol. 16, no. 1. Journal of Al-Qadisiyah for Computer Science and
Mathematics, Mar. 30, 2024. doi: 10.29304/jqcsm.2024.16.11439.
hybrid models that combine all the algorithms’ strengths and [17] M. Aamir et al., “AMDDLmodel: Android smartphones malware
weaknesses to increase usability and relevance. detection using deep learning model,” PLOS ONE, vol. 19, no. 1.
Public Library of Science (PLoS), p. e0296722, Jan. 19, 2024. doi:
REFERENCES 10.1371/journal.pone.0296722.
[18] Prof. Aravinda Thejas Chandra, Ms. Sindhu R, Ms. Spoorthi H, Ms.
[1] https://gs.statcounter.com/os-market-share/mobile/worldwide. Prerana R P, and Ms. V Bhavana, “Detection of Malware Trojans
[2] F. Nawshin, R. Gad, D. Unal, A. K. Al-Ali, and P. N. Suganthan, in Software using Machine Learning,” International Journal of
“Malware detection for mobile computing using secure and Advanced Research in Science, Communication and Technology.
privacy-preserving machine learning approaches: A comprehensive Naksh Solutions, pp. 514–519, May 06, 2024. doi: 10.48175/ijarsct-
survey,” Computers and Electrical Engineering, vol. 117. Elsevier 18083.
BV, p. 109233, Jul. 2024. doi: [19] A. K. T. Lee Yam, J. M. R. Ballesta, J. A. H. Lanceta, M. K. T.
10.1016/j.compeleceng.2024.109233. Mogol, and R. Labanan, “Hybrid Android Malware Detection
[3] F. A. Aboaoja, A. Zainal, F. A. Ghaleb, B. A. S. Al-rimy, T. A. E. Model using Machine learning Algorithms,” 2022 2nd International
Eisa, and A. A. H. Elnour, “Malware Detection Issues, Challenges, Conference in Information and Computing Research (iCORE).
and Future Directions: A Survey,” Applied Sciences, vol. 12, no. IEEE, Dec. 2022. doi: 10.1109/icore58172.2022.00032.
17. MDPI AG, p. 8482, Aug. 25, 2022. doi: 10.3390/app12178482. [20] S. T, J. Subramanian, S. R, G. Subramanian, Y. Sreekar, and V.
[4] M. Sharma and A. Kaul, “A review of detecting malware in android Bansal, “End-to-End Implementation of Malware Detection Using
devices based on machine learning techniques,” Expert Systems, Machine Learning,” Proceedings of the 1st International
vol. 41, no. 1. Wiley, Oct. 24, 2023. doi: 10.1111/exsy.13482. Conference on Artificial Intelligence, Communication, IoT, Data
[5] A. Bhattacharya and R. T. Goswami, “Community Based Feature Engineering and Security, IACIDS 2023, 23-25 November 2023,
Selection Method for Detection of Android Malware,” Journal of Lavasa, Pune, India. EAI, 2024. doi: 10.4108/eai.23-11-
Global Information Management, vol. 26, no. 3. IGI Global, pp. 54– 2023.2343241.
77, Jul. 01, 2018. doi: 10.4018/jgim.2018070105. [21] İ. Atacak, “An Ensemble Approach Based on Fuzzy Logic Using
[6] A. Mahindru et al., “PermDroid a framework developed using Machine Learning Classifiers for Android Malware Detection,”
proposed feature selection approach and machine learning Applied Sciences, vol. 13, no. 3. MDPI AG, p. 1484, Jan. 23, 2023.
techniques for Android malware detection,” Scientific Reports, vol. doi: 10.3390/app13031484.
14, no. 1. Springer Science and Business Media LLC, May 10, [22] M. V. H Sai Sriraj, K. K. Thambi, B. S. Venkat Teja, and V. A.
2024. doi: 10.1038/s41598-024-60982-y. Woonna, “Malware Detection in Android Based Devices by a
[7] S. Vanjire and M. Lakshmi, “Behavior-Based Malware Detection Hybrid Approach Using Machine Learning Techniques,” 2023 3rd
System Approach For Mobile Security Using Machine Learning,” Asian Conference on Innovation in Technology (ASIANCON).
2021 International Conference on Artificial Intelligence and IEEE, Aug. 25, 2023. doi: 10.1109/asiancon58793.2023.10269963.
Machine Vision (AIMV). IEEE, Sep. 24, 2021. doi: [23] Danny Revaldo, “Android Malware Detection Dataset.” Kaggle,
10.1109/aimv53313.2021.9671009. 2024. doi: 10.34740/KAGGLE/DSV/7689244.
[8] C. Chimeleze et al., “BFEDroid: A Feature Selection Technique to
Detect Malware in Android Apps Using Machine Learning,”
Security and Communication Networks, vol. 2022. Hindawi
Limited, pp. 1–24, Oct. 11, 2022. doi: 10.1155/2022/5339926.
[9] Y. Zhou, G. Cheng, S. Yu, Z. Chen, and Y. Hu, “MTDroid: A
Moving Target Defense-Based Android Malware Detector Against
Evasion Attacks,” IEEE Transactions on Information Forensics and
Security, vol. 19. Institute of Electrical and Electronics Engineers
(IEEE), pp. 6377–6392, 2024. doi: 10.1109/tifs.2024.3414339.
615
Authorized licensed use limited to: Beijing University of Chemical Technology. Downloaded on November 03,2024 at 11:13:02 UTC from IEEE Xplore. Restrictions apply.