-
Fast & Furious: Modelling Malware Detection as Evolving Data Streams
Authors:
Fabrício Ceschin,
Marcus Botacin,
Heitor Murilo Gomes,
Felipe Pinagé,
Luiz S. Oliveira,
André Grégio
Abstract:
Malware is a major threat to computer systems and imposes many challenges to cyber security. Targeted threats, such as ransomware, cause millions of dollars in losses every year. The constant increase of malware infections has been motivating popular antiviruses (AVs) to develop dedicated detection strategies, which include meticulously crafted machine learning (ML) pipelines. However, malware dev…
▽ More
Malware is a major threat to computer systems and imposes many challenges to cyber security. Targeted threats, such as ransomware, cause millions of dollars in losses every year. The constant increase of malware infections has been motivating popular antiviruses (AVs) to develop dedicated detection strategies, which include meticulously crafted machine learning (ML) pipelines. However, malware developers unceasingly change their samples' features to bypass detection. This constant evolution of malware samples causes changes to the data distribution (i.e., concept drifts) that directly affect ML model detection rates, something not considered in the majority of the literature work. In this work, we evaluate the impact of concept drift on malware classifiers for two Android datasets: DREBIN (about 130K apps) and a subset of AndroZoo (about 285K apps). We used these datasets to train an Adaptive Random Forest (ARF) classifier, as well as a Stochastic Gradient Descent (SGD) classifier. We also ordered all datasets samples using their VirusTotal submission timestamp and then extracted features from their textual attributes using two algorithms (Word2Vec and TF-IDF). Then, we conducted experiments comparing both feature extractors, classifiers, as well as four drift detectors (DDM, EDDM, ADWIN, and KSWIN) to determine the best approach for real environments. Finally, we compare some possible approaches to mitigate concept drift and propose a novel data stream pipeline that updates both the classifier and the feature extractor. To do so, we conducted a longitudinal evaluation by (i) classifying malware samples collected over nine years (2009-2018), (ii) reviewing concept drift detection algorithms to attest its pervasiveness, (iii) comparing distinct ML approaches to mitigate the issue, and (iv) proposing an ML data stream pipeline that outperformed literature approaches.
△ Less
Submitted 15 August, 2022; v1 submitted 24 May, 2022;
originally announced May 2022.
-
Online Binary Models are Promising for Distinguishing Temporally Consistent Computer Usage Profiles
Authors:
Luiz Giovanini,
Fabrício Ceschin,
Mirela Silva,
Aokun Chen,
Ramchandra Kulkarni,
Sanjay Banda,
Madison Lysaght,
Heng Qiao,
Nikolaos Sapountzis,
Ruimin Sun,
Brandon Matthews,
Dapeng Oliver Wu,
André Grégio,
Daniela Oliveira
Abstract:
This paper investigates whether computer usage profiles comprised of process-, network-, mouse-, and keystroke-related events are unique and consistent over time in a naturalistic setting, discussing challenges and opportunities of using such profiles in applications of continuous authentication. We collected ecologically-valid computer usage profiles from 31 MS Windows 10 computer users over 8 we…
▽ More
This paper investigates whether computer usage profiles comprised of process-, network-, mouse-, and keystroke-related events are unique and consistent over time in a naturalistic setting, discussing challenges and opportunities of using such profiles in applications of continuous authentication. We collected ecologically-valid computer usage profiles from 31 MS Windows 10 computer users over 8 weeks and submitted this data to comprehensive machine learning analysis involving a diverse set of online and offline classifiers. We found that: (i) profiles were mostly consistent over the 8-week data collection period, with most (83.9%) repeating computer usage habits on a daily basis; (ii) computer usage profiling has the potential to uniquely characterize computer users (with a maximum F-score of 99.90%); (iii) network-related events were the most relevant features to accurately recognize profiles (95.69% of the top features distinguishing users were network-related); and (iv) binary models were the most well-suited for profile recognition, with better results achieved in the online setting compared to the offline setting (maximum F-score of 99.90% vs. 95.50%).
△ Less
Submitted 2 September, 2021; v1 submitted 20 May, 2021;
originally announced May 2021.
-
People Still Care About Facts: Twitter Users Engage More with Factual Discourse than Misinformation--A Comparison Between COVID and General Narratives on Twitter
Authors:
Mirela Silva,
Fabrício Ceschin,
Prakash Shrestha,
Christopher Brant,
Shlok Gilda,
Juliana Fernandes,
Catia S. Silva,
André Grégio,
Daniela Oliveira,
Luiz Giovanini
Abstract:
Misinformation entails the dissemination of falsehoods that leads to the slow fracturing of society via decreased trust in democratic processes, institutions, and science. The public has grown aware of the role of social media as a superspreader of untrustworthy information, where even pandemics have not been immune. In this paper, we focus on COVID-19 misinformation and examine a subset of 2.1M t…
▽ More
Misinformation entails the dissemination of falsehoods that leads to the slow fracturing of society via decreased trust in democratic processes, institutions, and science. The public has grown aware of the role of social media as a superspreader of untrustworthy information, where even pandemics have not been immune. In this paper, we focus on COVID-19 misinformation and examine a subset of 2.1M tweets to understand misinformation as a function of engagement, tweet content (COVID-19- vs. non-COVID-19-related), and veracity (misleading or factual). Using correlation analysis, we show the most relevant feature subsets among over 126 features that most heavily correlate with misinformation or facts. We found that (i) factual tweets, regardless of whether COVID-related, were more engaging than misinformation tweets; and (ii) features that most heavily correlated with engagement varied depending on the veracity and content of the tweet.
△ Less
Submitted 9 September, 2021; v1 submitted 3 December, 2020;
originally announced December 2020.
-
Machine Learning (In) Security: A Stream of Problems
Authors:
Fabrício Ceschin,
Marcus Botacin,
Albert Bifet,
Bernhard Pfahringer,
Luiz S. Oliveira,
Heitor Murilo Gomes,
André Grégio
Abstract:
Machine Learning (ML) has been widely applied to cybersecurity and is considered state-of-the-art for solving many of the open issues in that field. However, it is very difficult to evaluate how good the produced solutions are, since the challenges faced in security may not appear in other areas. One of these challenges is the concept drift, which increases the existing arms race between attackers…
▽ More
Machine Learning (ML) has been widely applied to cybersecurity and is considered state-of-the-art for solving many of the open issues in that field. However, it is very difficult to evaluate how good the produced solutions are, since the challenges faced in security may not appear in other areas. One of these challenges is the concept drift, which increases the existing arms race between attackers and defenders: malicious actors can always create novel threats to overcome the defense solutions, which may not consider them in some approaches. Due to this, it is essential to know how to properly build and evaluate an ML-based security solution. In this paper, we identify, detail, and discuss the main challenges in the correct application of ML techniques to cybersecurity data. We evaluate how concept drift, evolution, delayed labels, and adversarial ML impact the existing solutions. Moreover, we address how issues related to data collection affect the quality of the results presented in the security literature, showing that new strategies are needed to improve current solutions. Finally, we present how existing solutions may fail under certain circumstances, and propose mitigations to them, presenting a novel checklist to help the development of future ML solutions for cybersecurity.
△ Less
Submitted 4 September, 2023; v1 submitted 29 October, 2020;
originally announced October 2020.