-
Fast & Furious: Modelling Malware Detection as Evolving Data Streams
Authors:
Fabrício Ceschin,
Marcus Botacin,
Heitor Murilo Gomes,
Felipe Pinagé,
Luiz S. Oliveira,
André Grégio
Abstract:
Malware is a major threat to computer systems and imposes many challenges to cyber security. Targeted threats, such as ransomware, cause millions of dollars in losses every year. The constant increase of malware infections has been motivating popular antiviruses (AVs) to develop dedicated detection strategies, which include meticulously crafted machine learning (ML) pipelines. However, malware dev…
▽ More
Malware is a major threat to computer systems and imposes many challenges to cyber security. Targeted threats, such as ransomware, cause millions of dollars in losses every year. The constant increase of malware infections has been motivating popular antiviruses (AVs) to develop dedicated detection strategies, which include meticulously crafted machine learning (ML) pipelines. However, malware developers unceasingly change their samples' features to bypass detection. This constant evolution of malware samples causes changes to the data distribution (i.e., concept drifts) that directly affect ML model detection rates, something not considered in the majority of the literature work. In this work, we evaluate the impact of concept drift on malware classifiers for two Android datasets: DREBIN (about 130K apps) and a subset of AndroZoo (about 285K apps). We used these datasets to train an Adaptive Random Forest (ARF) classifier, as well as a Stochastic Gradient Descent (SGD) classifier. We also ordered all datasets samples using their VirusTotal submission timestamp and then extracted features from their textual attributes using two algorithms (Word2Vec and TF-IDF). Then, we conducted experiments comparing both feature extractors, classifiers, as well as four drift detectors (DDM, EDDM, ADWIN, and KSWIN) to determine the best approach for real environments. Finally, we compare some possible approaches to mitigate concept drift and propose a novel data stream pipeline that updates both the classifier and the feature extractor. To do so, we conducted a longitudinal evaluation by (i) classifying malware samples collected over nine years (2009-2018), (ii) reviewing concept drift detection algorithms to attest its pervasiveness, (iii) comparing distinct ML approaches to mitigate the issue, and (iv) proposing an ML data stream pipeline that outperformed literature approaches.
△ Less
Submitted 15 August, 2022; v1 submitted 24 May, 2022;
originally announced May 2022.
-
Malware MultiVerse: From Automatic Logic Bomb Identification to Automatic Patching and Tracing
Authors:
Marcus Botacin,
André Grégio
Abstract:
Malware and other suspicious software often hide behaviors and components behind logic bombs and context-sensitive execution paths. Uncovering these is essential to react against modern threats, but current solutions are not ready to detect these paths in a completely automated manner. To bridge this gap, we propose the Malware Multiverse (MalVerse), a solution able to inspect multiple execution p…
▽ More
Malware and other suspicious software often hide behaviors and components behind logic bombs and context-sensitive execution paths. Uncovering these is essential to react against modern threats, but current solutions are not ready to detect these paths in a completely automated manner. To bridge this gap, we propose the Malware Multiverse (MalVerse), a solution able to inspect multiple execution paths via symbolic execution aiming to discover function inputs and returns that trigger malicious behaviors. MalVerse automatically patches the context-sensitive functions with the identified symbolic values to allow the software execution in a traditional sandbox. We implemented MalVerse on top of angr and evaluated it with a set of Linux and Windows evasive samples. We found that MalVerse was able to generate automatic patches for the most common evasion techniques (e.g., ptrace checks).
△ Less
Submitted 13 September, 2021;
originally announced September 2021.
-
A [in]Segurança dos Sistemas Governamentais Brasileiros: Um Estudo de Caso em Sistemas Web e Redes Abertas
Authors:
Marcus Botacin,
André Grégio
Abstract:
Whereas the world relies on computer systems for providing public services, there is a lack of academic work that systematically assess the security of government systems. To partially fill this gap, we conducted a security evaluation of publicly available systems from public institutions. We revisited OWASP top-10 and identified multiple vulnerabilities in deployed services by scanning public gov…
▽ More
Whereas the world relies on computer systems for providing public services, there is a lack of academic work that systematically assess the security of government systems. To partially fill this gap, we conducted a security evaluation of publicly available systems from public institutions. We revisited OWASP top-10 and identified multiple vulnerabilities in deployed services by scanning public government networks. Overall, the unprotected services found have inadequate security level, which must be properly discussed and addressed.
△ Less
Submitted 13 September, 2021;
originally announced September 2021.
-
Machine Learning (In) Security: A Stream of Problems
Authors:
Fabrício Ceschin,
Marcus Botacin,
Albert Bifet,
Bernhard Pfahringer,
Luiz S. Oliveira,
Heitor Murilo Gomes,
André Grégio
Abstract:
Machine Learning (ML) has been widely applied to cybersecurity and is considered state-of-the-art for solving many of the open issues in that field. However, it is very difficult to evaluate how good the produced solutions are, since the challenges faced in security may not appear in other areas. One of these challenges is the concept drift, which increases the existing arms race between attackers…
▽ More
Machine Learning (ML) has been widely applied to cybersecurity and is considered state-of-the-art for solving many of the open issues in that field. However, it is very difficult to evaluate how good the produced solutions are, since the challenges faced in security may not appear in other areas. One of these challenges is the concept drift, which increases the existing arms race between attackers and defenders: malicious actors can always create novel threats to overcome the defense solutions, which may not consider them in some approaches. Due to this, it is essential to know how to properly build and evaluate an ML-based security solution. In this paper, we identify, detail, and discuss the main challenges in the correct application of ML techniques to cybersecurity data. We evaluate how concept drift, evolution, delayed labels, and adversarial ML impact the existing solutions. Moreover, we address how issues related to data collection affect the quality of the results presented in the security literature, showing that new strategies are needed to improve current solutions. Finally, we present how existing solutions may fail under certain circumstances, and propose mitigations to them, presenting a novel checklist to help the development of future ML solutions for cybersecurity.
△ Less
Submitted 4 September, 2023; v1 submitted 29 October, 2020;
originally announced October 2020.
-
A Praise for Defensive Programming: Leveraging Uncertainty for Effective Malware Mitigation
Authors:
Ruimin Sun,
Marcus Botacin,
Nikolaos Sapountzis,
Xiaoyong Yuan,
Matt Bishop,
Donald E Porter,
Xiaolin Li,
Andre Gregio,
Daniela Oliveira
Abstract:
A promising avenue for improving the effectiveness of behavioral-based malware detectors would be to combine fast traditional machine learning detectors with high-accuracy, but time-consuming deep learning models. The main idea would be to place software receiving borderline classifications by traditional machine learning methods in an environment where uncertainty is added, while software is anal…
▽ More
A promising avenue for improving the effectiveness of behavioral-based malware detectors would be to combine fast traditional machine learning detectors with high-accuracy, but time-consuming deep learning models. The main idea would be to place software receiving borderline classifications by traditional machine learning methods in an environment where uncertainty is added, while software is analyzed by more time-consuming deep learning models. The goal of uncertainty would be to rate-limit actions of potential malware during the time consuming deep analysis. In this paper, we present a detailed description of the analysis and implementation of CHAMELEON, a framework for realizing this uncertain environment for Linux. CHAMELEON offers two environments for software: (i) standard - for any software identified as benign by conventional machine learning methods and (ii) uncertain - for software receiving borderline classifications when analyzed by these conventional machine learning methods. The uncertain environment adds obstacles to software execution through random perturbations applied probabilistically on selected system calls. We evaluated CHAMELEON with 113 applications and 100 malware samples for Linux. Our results showed that at threshold 10%, intrusive and non-intrusive strategies caused approximately 65% of malware to fail accomplishing their tasks, while approximately 30% of the analyzed benign software to meet with various levels of disruption. With a dynamic, per-system call threshold, CHAMELEON caused 92% of the malware to fail, and only 10% of the benign software to be disrupted. We also found that I/O-bound software was three times more affected by uncertainty than CPU-bound software. Further, we analyzed the logs of software crashed with non-intrusive strategies, and found that some crashes are due to the software bugs.
△ Less
Submitted 12 June, 2020; v1 submitted 7 February, 2018;
originally announced February 2018.