Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

SPARSE: Semantic Tracking and Path Analysis for Attack Investigation in Real-time

Jie Ying, Tiantian Zhu*, Wenrui Cheng, Qixuan Yuan, Mingjun Ma, Chunlin Xiong, Tieming Chen, Mingqi Lv, and Yan Chen, IEEE Fellow
Abstract

As the complexity and destructiveness of Advanced Persistent Threat (APT) increase, there is a growing tendency to identify a series of actions undertaken to achieve the attacker’s target, called attack investigation. Currently, analysts construct the provenance graph to perform causality analysis on Point-Of-Interest (POI) event for capturing critical events (related to the attack). However, due to the vast size of the provenance graph and the rarity of critical events, existing attack investigation methods suffer from problems of high false positives, high overhead, and high latency.

To this end, we propose SParse, an efficient and real-time system for constructing critical component graphs (i.e., consisting of critical events) from streaming logs. Our key observation is 1) Critical events exist in a suspicious semantic graph (SSG) composed of interaction flows between suspicious entities, and 2) Information flows that accomplish attacker’s goal exist in the form of paths. Therefore, SParse uses a two-stage framework to implement attack investigation (i.e., constructing the SSG and performing path-level contextual analysis). First, SParse operates in a state-based mode where events are consumed as streams, allowing easy access to the SSG related to the POI event through semantic transfer rule and storage strategy. Then, SParse identifies all suspicious flow paths (SFPs) related to the POI event from the SSG, quantifies the influence of each path to filter irrelevant events. Our evaluation on a real large-scale attack dataset shows that SParse can generate a critical component graph (similar-to\sim 113 edges) in 1.6 seconds, which is 2014 ×\times× smaller than the backtracking graph (similar-to\sim 227,589 edges). SParse is 25 ×\times× more effective than other state-of-the-art techniques in filtering irrelevant edges.

Index Terms:
Advanced Persistent Threat, Intrusion/Anomaly Detection and Investigation, Data Provenance.

I Introduction

As the Internet has developed over time, APT attacks have grown more sophisticated and destructive. APT attacks target mainly large corporations such as Twitter [1], resulting in significant financial losses and reputational damage. In addition, APT attacks are executed in multiple stages, which include initial access, persistence, lateral movement, collection, and exfiltration [2].

While an intrusion may be noticed at any stage, detection only uncovers isolated traces of the attack. As a result, analysts must undertake causality analysis to capture the bigger picture and obtain a sound understanding of the detected attack point. Achieving a secure system recovery after a cyber attack requires certain key steps. First, analysts must determine how the adversary infiltrated the system. Once the point of entry is identified, then analysts need to assess the obvious and hidden damage done to the system, such as installed payload, modified files, and exfiltrated information. In short, analysts need to identify the sequence of critical events leading up to the POI event and reconstruct the critical component (subgraph consisting of critical events), which is also called attack investigation.

With the improvement of kernel-level monitoring frameworks [3, 4, 5], more and more causality analysis systems depend on a provenance graph consisting of entities (e.g., files, processes, and sockets) and inter-entity interactions (e.g., processes reading and writing files). However, the auditing framework is known to generate a large number of logs, up to several gigabytes per day on a single machine [6, 7], resulting in a massive graph with billions of edges. This leads to critical events that cause the attack to be drowned out by irrelevant events of normal behavior. Also, a provenance graph is a coarse-grained data format that cannot directly determine the specific dependencies between all relevant events of an entity (e.g., a process has multiple read-in events and write-out events) [8]. The dependency explosion problem [9, 10, 11] caused by these conditions leads to the poor performance of existing causality analysis systems.

TABLE I: Comparison table of related work on attack investigation performance. Column 5 (Storage of Historical Data) indicates whether to store the raw audit logs. The solidness of the marked circle reflects the degree: High (●), Medium (◑), Low (○).
Technique System
False Positive
Rate
False Negative
Rate
Storage of
Historical Data
Memory
Overhead
Time
Overhead
Label Propagation-based HOLMES [12]
RapSheet [13]
APTSHIELD [14]
Anomaly Score-based NODOZE [15]
Swift [16]
PRIOTRACKER [17]
DEPIMPECT [18]
Machine Learning-based ATLAS [19]
DEPCOMM [20]
WASTON [21]
/ SParse

Methodologically, causality analysis can be classified into three categories: label propagation-based, anomaly score-based, and machine learning-based. Specifically, the label propagation-based approach [12, 22, 23, 14] sets entity labels and transformation rules through heuristic rules but suffers from a reliance on heavy manual effort and the incapacity to address zero-day vulnerabilities. The anomaly score-based approach [18, 15, 17, 16] quantifies the suspiciousness of dependency between entities, but faces challenges such as relying on historical statistics and the inability to adapt to complex enterprise production environments. The machine learning-based approach [19, 20, 21] employs neural networks to learn from attack samples but is hindered by insufficient sample size, poor generalization capability, and high computational overhead. These issues, as shown in Table I, make it challenging for analysts to conduct attack investigation within the optimal time (10 minutes) [24] while handling massive alerts. In summary, a general, efficient, and cost-effective causality analysis system needs to meet the following three requirements: 1) Reduced False Positives to address dependency explosion, 2) Affordable Overhead to reduce the cost of attack investigation, and 3) Minimal Latency to prevent further losses caused by subsequent attacks.

Key Insight. To meet the above requirements, after researching hundreds of APT attack descriptions [25] and analyzing numerous related dependency graphs [26, 27, 28, 17, 18, 19], we have the following two key insights. Firstly, critical events must exist in the suspicious semantic graph (SSG) formed by interactions between suspicious entities. Specifically, we construct the SSG consisting of suspicious entities (e.g., a process visiting an unknown website) and suspicious events (data and control flows initiated by the suspicious entities). We believe that the SSG contains all critical events (i.e., critical events are a subset of suspicious events) and is much smaller than the subgraph obtained by backtracking [29] from the POI event, as shown in Figure 1. Secondly, information flows that accomplish goals exist in the form of paths. In other words, we opine that evaluating whether to filter an event cannot be done in isolation, but rather calls for a comprehensive assessment of flow paths consisting of multiple events. Consequently, we construct all suspicious flow paths (SFPs) related to the POI event from SSG and quantify the degree of influence based on the properties of the POI event and path structural characteristics to weed out irrelevant events. In summary, we achieve the attack investigation through a two-stage step (i.e., SSG construction and path-level contextual analysis).

In summary, this paper proposes SParse111SParse short for Semantic tracking and Path Analysis foR attack inveStigation in real-timE and makes the following contributions:

  • We propose a state-based framework that contains suspicious semantic transfer rule and suspicious event storage strategy. The framework consumes events as streams in low overhead without recording historical data. In addition, the framework can output suspicious semantic graph related to the POI event in real time. The graph consists of all suspicious data flows and control flows that lead to the POI event, and thus contains all attack-related critical events. It is phase I of SParse for filtering semantic-irrelevant events.

  • We propose a path-level contextual analysis mechanism that incorporates suspicious flow path extraction and scoring. It utilizes an optimized BFS algorithm to extract all suspicious flow paths (SFPs) from the SSG. Then the mechanism combines the properties of the POI event and characteristics of the path structure to quantify the impact of each SFP on the POI event. Finally, it filters all events that only exist in SFPs with low scores. It is phase II of SParse for filtering impact-irrelevant events.

  • We implemented SParse and evaluated all its components in detail on a large-scale dataset with more than 150 million logs. Specifically, the dataset contains 10 simulated attacks [18] (similar-to\sim 100 million logs) and 5 attacks from the DARPA TC program [30, 31] (similar-to\sim 50 million logs). Experimental results show that SParse can generate the critical component graph (similar-to\sim 113 edges) in 1.6s, which is 2014 ×\times× smaller than the dependency graph (similar-to\sim 227,589 edges). The critical component graph (FP = 99) generated by SParse is 25 ×\times× more effective than other state-of-the-art causality analysis techniques (FP = 2,473) in filtering irrelevant edges while preserving the attack sequences. In addition, SParse can run for a long time while processing streaming logs with a low memory overhead (30MB).

II Background and Motivation

II-A Dependency Graph

Recent literature has leveraged the concept of data provenance, i.e., instead of manually piecing together individual evidence from raw logs, provenance-based systems can construct dependency graphs that explain the relationships between each event, simplifying the attack investigation. Specifically, a dependency graph G(E,V)𝐺𝐸𝑉G(E,\ V)italic_G ( italic_E , italic_V ) is a heterogeneous graph consisting of nodes V𝑉Vitalic_V representing system entities and edges E𝐸Eitalic_E representing inter-entity events. The attributes of entities and events are carefully selected from raw audit logs, which are lean and critical. For entities, we choose processes (ProcessName,ProcessID𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑁𝑎𝑚𝑒𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝐼𝐷\langle ProcessName,\ ProcessID\rangle⟨ italic_P italic_r italic_o italic_c italic_e italic_s italic_s italic_N italic_a italic_m italic_e , italic_P italic_r italic_o italic_c italic_e italic_s italic_s italic_I italic_D ⟩) , files (FileNamedelimited-⟨⟩𝐹𝑖𝑙𝑒𝑁𝑎𝑚𝑒\langle FileName\rangle⟨ italic_F italic_i italic_l italic_e italic_N italic_a italic_m italic_e ⟩), and sockets (IP:Portdelimited-⟨⟩:𝐼𝑃𝑃𝑜𝑟𝑡\langle IP\ :\ Port\rangle⟨ italic_I italic_P : italic_P italic_o italic_r italic_t ⟩). For events, we made selections as shown in Table II. For any edge eE𝑒𝐸e\in Eitalic_e ∈ italic_E, there is e=(u,v,t)𝑒𝑢𝑣𝑡e=(u,v,t)italic_e = ( italic_u , italic_v , italic_t ), where u𝑢uitalic_u represents the subject, v𝑣vitalic_v represents the object, and t𝑡titalic_t represents the timestamp of the event. For the two edges in the dependency graph, e1=(u1,v1,t1)𝑒1subscript𝑢1subscript𝑣1subscript𝑡1e1=(u_{1},v_{1},t_{1})italic_e 1 = ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), e2=(u2,v2,t2)𝑒2subscript𝑢2subscript𝑣2subscript𝑡2e2=(u_{2},v_{2},t_{2})italic_e 2 = ( italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), we consider that there is a dependence (causality) between e1subscript𝑒1e_{1}italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and e2subscript𝑒2e_{2}italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT if v1=u2subscript𝑣1subscript𝑢2v_{1}=u_{2}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and t1<t2subscript𝑡1subscript𝑡2t_{1}<t_{2}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

TABLE II: Attributes of system events
Event Operations Attributes
Process Event
execve, clone
Time Stamp Subject Name Object Name Data Amount
File Event
write, read,
readv, writev
Network Event
sendto, recvfrom
Refer to caption
Figure 1: Partial dependency graph of one attack case Dataleak. The black dashed box indicates the backtracking graph (similar-to\sim 200,000 edges) constructed from the POI event via backward propagation. The blue dashed box indicates the suspicious semantic graph (similar-to\sim 27 edges) constructed by SParse. The red dashed line box indicates the critical component graph (similar-to\sim 22 edges) exported by SParse.

II-B Attack Investigation

The goal of attack investigation using dependency graphs [12, 15, 13, 16, 20, 18, 14, 23, 32] is to identify all critical events and critical components related to the POI event. A critical component is a subgraph of the dependency graph that retains internal information critical to the attack investigation and eliminates irrelevant system activity. Typically this analysis includes tracing the flow of data through the graph to identify potentially relevant events, and examining the properties of nodes and edges to identify signs of compromise. The goal of an attack investigation is to determine both the source and scope of the attack, ascertain the extent of damage or disruption, and develop remediation and prevention strategies.

II-C Motivating Example

As shown in Figure 1, this is a typical data leakage attack. The attacker exploited a vulnerability in apache2 and downloaded the malicious artifact gather.sh. After executing the malware, the attacker collected sensitive data from the target host and saved it in the form of the file leaked.vm2. After using gpg to compress leaked.vm2 into the file leaked, the attacker transferred leaked to the C2 server 192.168.2.3:xx via process ssh.

In Figure 1, the black dashed box denotes the backtracking graph obtained by performing backward causality analysis [29], which includes all events causal-related to the alert. The blue dashed box denotes the suspicious semantic graph obtained by using suspicious semantic transfer, which includes all events suspicious semantic-related to the alert. The red dashed box denotes the critical component graph obtained by analyzing the path-level contextual semantics, which includes all events attack-related to the alert.

Obviously, the number of attack-related critical events (similar-to\sim 22) is a drop in the ocean compared to the number of causal-related non-critical events (similar-to\sim 200,000). This turns the attack investigation into a needle-in-a-haystack process, making it challenging for analysts to complete the investigation in the optimal time (600s) [24]. However, existing technique such as DEPIMPACT [18], as shown in Section V, has exhibited poor performance. It requires 6,464s (much-greater-than\gg 600s) to generate dependency graphs with 2,473 false positives on average. In addition, it needs to load raw audit logs, resulting in an endless memory overhead. Therefore, we need an attack investigation system with low false positives, low latency and low overhead.

III Overview

III-A Threat Model

First, we assume that the event logs and digital signatures are credible, similar to previous work [33, 34, 27, 15, 29, 17, 13, 18]. In addition, events related to the attack did not occur before the logs were processed.

Second, we assume that the attacker is external to the system and carries out their attack remotely. This may involve exploiting vulnerabilities within the system or employing social engineering tactics to convince a user to download and run a file containing malicious code. Therefore, we do not support side-channel attacks and insider attacks where the attacker has a legitimate way to access the machine without going through them.

Third, we exclude mimicry attacks [35] from consideration in our threat model. These attacks are designed to evade intrusion detection systems by creating a seemingly benign chain of events within an enterprise environment. Existing intrusion detection systems [36, 37, 38] often rely on heuristics or analysis of individual events, making them vulnerable to such attacks. While detecting mimicry attacks is a limitation of current detection systems, it falls outside the scope of our work. Our focus is on identifying relevant events of alert generated by the detection system as contextual information to investigate the attack.

III-B Our Approach

In this section, we describe the architecture of SParse shown in Figure 2. Given a POI event, SParse can automatically identify the critical component of the dependency graph. SParse consists of two phases: (I) suspicious semantic graph construction (SSGC) and (II) path-level contextual analysis (PCA).

In Phase I, SParse makes use of mature auditing systems [4, 39, 40, 41, 42] to access kernel-level streaming logs and process them into specific data structures. Then SParse proposes a suspicious semantic transfer rule and storage strategy to maintain the suspicious entity list and related event table with low memory overhead. Given a POI event, SParse can construct the suspicious semantic graph (SSG) in real-time.

In Phase II, SParse first performs edge compaction on the suspicious semantic graph. Then SParse proposes a suspicious flow path extraction algorithm to identify possible propagation paths of the data/control flow in the suspicious semantic graph (i.e., suspicious flow paths). Next, SParse performs path-level contextual analysis, scores each suspicious flow path, and determines how relevant the path is to the POI event. Finally, SParse filters out all events that only exist in irrelevant paths from suspicious semantic graph to generate the critical component graph (CCG) as the output.

IV System Design

Refer to caption
Figure 2: Architecture of SParse.

In this section, we describe the design details of each phase of SParse. As shown in Figure 2, SParse is a two-phase framework (i.e., constructing suspicious semantic graph and performing path-level contextual analysis) for mitigating the dependency explosion problem.

IV-A Goal and Key Insight

IV-A1 Suspicious Semantic Graph Construction

Goal. Given an alert point, current investigation techniques [15, 43, 17, 18, 28] store audit logs in memory (high overhead) and construct a backtracking graph [29] from the alert point. However, it usually includes numerous events that are impossible to result in an attack (high false positives), such as reads to read-only files, and interactions with benign processes. Additionally, it takes time to identify related events by going through these logs (high latency). In summary, the backtracking graph gives rise to the problems of high memory overhead, high time overhead, and high false positives in existing approaches. Therefore, we aim to construct a suspicious semantic graph with low memory overhead in real-time, which is smaller in size than the backtracking graph but contains all attack-related events.

Key Insight. To achieve the aforementioned goal, there are two key insights upon which we rely. (1) Suspicious semantics are introduced externally, i.e., attack is implemented remotely, as defined in Section III-A. (2) Suspicious semantics propagate between entities, i.e., suspicious entities transmit the suspicious semantics to non-suspicious entities via interaction.

Based on the above insights, we present a state-based framework to achieve the goal, which includes Section IV-B Streaming Log Monitoring and Section IV-C Suspicious Semantic Transfer.

IV-A2 Path-level Contextual Analysis

Goal. Once the POI event is identified, we construct the corresponding suspicious semantic graph. This suspicious semantic graph consists of all events semantically related to the POI event and contains all attack-related events (i.e., critical events). As shown in Section V, the SSG (similar-to\sim 417 edges) is 545 times𝑡𝑖𝑚𝑒𝑠timesitalic_t italic_i italic_m italic_e italic_s smaller than the backtracking graph (similar-to\sim 227,589 edges) but 3.7 times𝑡𝑖𝑚𝑒𝑠timesitalic_t italic_i italic_m italic_e italic_s larger than the critical component graph (similar-to\sim 113 edges). This suggests that there are still many false positives in the suspicious semantics graph. Therefore, we aim to filter out the events that are contextually irrelevant to the POI event in the suspicious semantics graph by performing path-level contextual analysis. By mitigating the dependency explosion problem for the second time, we obtain a critical component graph to assist analysts in conducting attack investigation.

Key Insight. To achieve path-level contextual analysis, we rely primarily on the following two key insights : (1) Only by evaluating data/control flow paths as a whole we can determine whether they have an impact on the POI event. In other words, we cannot determine whether an event has impacted a POI event in isolation (i.e., at the event-level) [15, 18], but rather in context (i.e., at the path-level). (2) Quantifying the degree of impact requires consideration of the properties and neighboring relationships of events.

Based on the above insights, we propose a path-level contextual analysis mechanism consisting of Section IV-D Edge Compaction, Section IV-E Suspicious Flow Path Extraction and Section IV-F Path-level Contextual Scoring.

IV-B Streaming Log Monitoring

SParse makes use of mature auditing systems [4, 39, 40, 41, 42] to access kernel-level logs and obtain the required data. At the entity level, SParse focuses on three entity types: file, process, and socket. To differentiate, SParse needs to construct unique identifiers for all entities. For the file, SParse records the absolute path as the unique identifier. For the process, SParse concatenates the PID and name as the unique identifier. For the socket, SParse constructs the 4-tuple (<srcip, srcport, dstip, dstport>) as the unique identifier. At the event level, SParse focuses on three event types: process interactions, file IO events, and network IO events. To the best of our knowledge, existing auditing systems are rich in semantics and meet the data requirements of SParse.

TABLE III: Suspicious Semantic Transfer Rule.
Event Type Subject Object Description
Recvfrom Socket Process A process receives data from the network, the process becomes suspicious.
Sendto Process Socket A suspicious process sends data to the network.
Read File Process A process reads a suspicious file, the process becomes suspicious.
Write Process File A suspicious process writes a file, the file is suspicious.
Execve/Clone Process Process A process is started by a suspicious process, the process is suspicious.

IV-C Suspicious Semantic Transfer

The letter P in APT stands for persistence, which means that an attacker can lurk for a long time until achieves the goal. To support real-time investigation and long-term monitoring, SParse utilizes a state-based structure and suspicious semantic transfer rule to record state changes and associated events for each entity. We next describe the specific data structure and transfer rule in turn.

IV-C1 Data Structure

For any entity vV𝑣𝑉v\in Vitalic_v ∈ italic_V, SParse represents it as a triple <U,Ty,S><U,\ T_{y},\ S>< italic_U , italic_T start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_S >. U𝑈Uitalic_U is the unique identifier of the entity, the construction of U𝑈Uitalic_U is described in Section IV-B. Tysubscript𝑇𝑦T_{y}italic_T start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT denotes the type of the entity and S denotes the state of the entity. When S𝑆Sitalic_S is 0 it means that the entity is not suspicious, and S𝑆Sitalic_S is 1 it means that the entity is suspicious. Note that the file and process have their S𝑆Sitalic_S initialized to 0 when they are created, and the socket has their S𝑆Sitalic_S initialized to 1 when it is created, i.e., we default to all sockets that are not in the whitelist being suspicious (Key Insight (1) in Section IV-A1).

For any event eE𝑒𝐸e\in Eitalic_e ∈ italic_E, SParse represents it as a quintuple <Us,Uo,O,Ti,D><U_{s},U_{o},O,T_{i},D>< italic_U start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT , italic_O , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_D >. Ussubscript𝑈𝑠U_{s}italic_U start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and Uosubscript𝑈𝑜U_{o}italic_U start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT are unique identifiers for the subject and object of e𝑒eitalic_e, respectively. O𝑂Oitalic_O denotes the type of e𝑒eitalic_e, Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the time when e𝑒eitalic_e occurred, and D𝐷Ditalic_D denotes the data flow amount of e𝑒eitalic_e. Note that SParse is based on the direction of the data flow and control flow to determine the location of the subject and object. For example, when O=Read𝑂𝑅𝑒𝑎𝑑O=Readitalic_O = italic_R italic_e italic_a italic_d, the data flow is from the file to the process, so the file is the subject and the process is the object. When O=Write𝑂𝑊𝑟𝑖𝑡𝑒O=Writeitalic_O = italic_W italic_r italic_i italic_t italic_e, the data flow is from the process to the file, so the process is the subject and the file is the object.

IV-C2 Transfer Rule

Based on the idea of semantic transfer (Key Insight (2) in Section IV-A1), SParse constructs a set of predefined rules to process streaming logs and identify entity states in real-time. As shown in Table III, each rule is a quadruple: <O,Ts,To,D><O,\ T_{s},\ T_{o},\ D>< italic_O , italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT , italic_D >. O𝑂Oitalic_O is the type of event, Tssubscript𝑇𝑠T_{s}italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and Tosubscript𝑇𝑜T_{o}italic_T start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT are the entity types of the subject and object respectively, and D𝐷Ditalic_D is a description of the rule. From Table III we can see that the subject is able to transfer suspicious semantics to the object via a specific event, which is referred to as the ”suspicious semantics transfer rule”. As shown in Figure 3, T𝑇Titalic_T denotes the moment, red entities denote suspicious entities, and red straight arrows denote suspicious semantic transfer. When T=3,𝑇3T=3,italic_T = 3 , a suspicious process (processA𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝐴process\ Aitalic_p italic_r italic_o italic_c italic_e italic_s italic_s italic_A) writes data to a file (fileC𝑓𝑖𝑙𝑒𝐶file\ Citalic_f italic_i italic_l italic_e italic_C), which in turn carries the suspicious semantic. When T=4𝑇4T=4italic_T = 4, the suspicious file is read by another process (processD𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝐷process\ Ditalic_p italic_r italic_o italic_c italic_e italic_s italic_s italic_D), which then carries the suspicious semantic. Conversely, if an entity has no suspicious semantic, any event involving this entity as a subject will not propagate suspicious semantic. For example, when T=2𝑇2T=2italic_T = 2, a file (fileB𝑓𝑖𝑙𝑒𝐵file\ Bitalic_f italic_i italic_l italic_e italic_B) is read by the suspicious entity (processA𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝐴process\ Aitalic_p italic_r italic_o italic_c italic_e italic_s italic_s italic_A), but there is no propagation of the suspicious semantic.

As shown in lines 5 to 11 in Algorithm 1, SParse processes the streaming logs, analyses data flows, and determines whether the entity state transitions. First, SParse accesses the event e=<Us,Uo,O,Ti,D>e=<U_{s},U_{o},O,T_{i},D>italic_e = < italic_U start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT , italic_O , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_D > and constructs the subject and object u𝑢uitalic_u, v𝑣vitalic_v corresponding to that event. Then, SParse determines whether the subject u𝑢uitalic_u is a socket or exists in suspicious entity list (SEL, see Section IV-C3 for detailed definition). Finally, as soon as one of these two conditions is met, SParse will mark the object v𝑣vitalic_v as suspicious and add it to SEL.

Algorithm 1 Suspicious Semantic Graph Construction
0:  
1:  (1) Streaming logs in chronological order;
2:  (2) Suspicious Entity List (SEL);
3:  (3) Related Event Table (RET);
4:  (4) POI event p𝑝pitalic_p;
4:  Suspicious semantic graph for POI event p𝑝pitalic_p;
5:  for eStreaminglogs𝑒𝑆𝑡𝑟𝑒𝑎𝑚𝑖𝑛𝑔𝑙𝑜𝑔𝑠e\in Streaming\ \ logsitalic_e ∈ italic_S italic_t italic_r italic_e italic_a italic_m italic_i italic_n italic_g italic_l italic_o italic_g italic_s do
6:     Construct u,v𝑢𝑣u,vitalic_u , italic_v from e𝑒eitalic_e where uU=eUs,vU=eUoformulae-sequencesubscript𝑢𝑈subscript𝑒subscript𝑈𝑠subscript𝑣𝑈subscript𝑒subscript𝑈𝑜u_{U}=e_{U_{s}},v_{U}=e_{U_{o}}italic_u start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT = italic_e start_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT = italic_e start_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_POSTSUBSCRIPT
7:     if uS==0anduUSELu_{S}==0\ \textbf{and}\ \nexists\ \ u_{U}\in SELitalic_u start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = = 0 and ∄ italic_u start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ∈ italic_S italic_E italic_L then
8:        continue;
9:     else
10:        vS=1subscript𝑣𝑆1v_{S}=1italic_v start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = 1;
11:        if  vUSELnot-existssubscript𝑣𝑈𝑆𝐸𝐿\nexists\ \ v_{U}\in SEL∄ italic_v start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ∈ italic_S italic_E italic_L  then
12:           SEL.append(vUsubscript𝑣𝑈v_{U}italic_v start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT);
13:        end if
14:        Add {vU::subscript𝑣𝑈absentv_{U}:italic_v start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT : RET[uUsubscript𝑢𝑈u_{U}italic_u start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT] + e} to RET;
15:     end if
16:     if  pUoSELsubscript𝑝subscript𝑈𝑜𝑆𝐸𝐿\exists\ p_{U_{o}}\in SEL∃ italic_p start_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ italic_S italic_E italic_L then
17:        
18:        return  graphConstruct(RET[pUosubscript𝑝subscript𝑈𝑜p_{U_{o}}italic_p start_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_POSTSUBSCRIPT])
19:     end if
20:  end for
Algorithm 2 Suspicious Flow Path Extraction
0:  
1:  (1) Suspicious Semantic Graph G𝐺Gitalic_G;
2:  (2) POI Event p𝑝pitalic_p;
3:  (3) Q𝑄Qitalic_Q and V𝑉Vitalic_V for the queue structure, T𝑇Titalic_T for the tree structure;
3:  Suspicious flow paths;
4:  Q.add(p)formulae-sequence𝑄𝑎𝑑𝑑𝑝Q.add(p)italic_Q . italic_a italic_d italic_d ( italic_p )
5:  T.creatNode(p)formulae-sequence𝑇𝑐𝑟𝑒𝑎𝑡𝑁𝑜𝑑𝑒𝑝T.creatNode(p)italic_T . italic_c italic_r italic_e italic_a italic_t italic_N italic_o italic_d italic_e ( italic_p )
6:  while Q.num0formulae-sequence𝑄𝑛𝑢𝑚0Q.num\neq 0italic_Q . italic_n italic_u italic_m ≠ 0 do
7:     e=Q.pop()formulae-sequence𝑒𝑄𝑝𝑜𝑝e=Q.pop()italic_e = italic_Q . italic_p italic_o italic_p ( )
8:     V.add(e)formulae-sequence𝑉𝑎𝑑𝑑𝑒V.add(e)italic_V . italic_a italic_d italic_d ( italic_e )
9:     for ieG.inEdges(eUs)formulae-sequence𝑖𝑒𝐺𝑖𝑛𝐸𝑑𝑔𝑒𝑠subscript𝑒subscript𝑈𝑠ie\ \in\ G.inEdges(e_{U_{s}})italic_i italic_e ∈ italic_G . italic_i italic_n italic_E italic_d italic_g italic_e italic_s ( italic_e start_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) do
10:        if ieTi<eTiandieQandieVformulae-sequence𝑖subscript𝑒subscript𝑇𝑖subscript𝑒subscript𝑇𝑖and𝑖𝑒𝑄and𝑖𝑒𝑉ie_{T_{i}}<e_{T_{i}}\ \ \textbf{and}\ \ ie\notin Q\ \textbf{and}\ ie\notin Vitalic_i italic_e start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT < italic_e start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and italic_i italic_e ∉ italic_Q and italic_i italic_e ∉ italic_V then
11:           Q.add(ie)formulae-sequence𝑄𝑎𝑑𝑑𝑖𝑒Q.add(ie)italic_Q . italic_a italic_d italic_d ( italic_i italic_e )
12:           T.creatNode(ie)formulae-sequence𝑇𝑐𝑟𝑒𝑎𝑡𝑁𝑜𝑑𝑒𝑖𝑒T.creatNode(ie)italic_T . italic_c italic_r italic_e italic_a italic_t italic_N italic_o italic_d italic_e ( italic_i italic_e )
13:           T.creatEdge(e,ie)formulae-sequence𝑇𝑐𝑟𝑒𝑎𝑡𝐸𝑑𝑔𝑒𝑒𝑖𝑒T.creatEdge(e,\ ie)italic_T . italic_c italic_r italic_e italic_a italic_t italic_E italic_d italic_g italic_e ( italic_e , italic_i italic_e )
14:        end if
15:     end for
16:  end while
17:  return  T.allPaths()formulae-sequence𝑇𝑎𝑙𝑙𝑃𝑎𝑡𝑠T.allPaths()italic_T . italic_a italic_l italic_l italic_P italic_a italic_t italic_h italic_s ( )
Refer to caption
Figure 3: An Example of Suspicious Semantic Transfer. The red solid line indicates that the entity carries suspicious semantic. SEL is short for Suspicious Entity List and RET is short for Relevant Event Table.

IV-C3 Storage Strategy

SParse designs two data structures to enable efficient storage of relevant data and real-time construction of the suspicious semantic graph. Specifically, SParse designs a Suspicious Entity List (SEL) and a Related Event Table (RET), as defined below.

Suspicious Entity List: A list that maintains all entities with suspicious semantics (possibly related to attacks). As shown in Figure 3, when T=1𝑇1T=1italic_T = 1, the data flow passes from the suspicious socket to process A𝐴Aitalic_A (suspicious semantic transfer), so SParse adds entity A𝐴Aitalic_A to SEL. When T=2𝑇2T=2italic_T = 2, there is no suspicious semantic transfer, so SEL is not changed. When T=5𝑇5T=5italic_T = 5, the data flow passes from suspicious file C𝐶Citalic_C to process A𝐴Aitalic_A, but entity A𝐴Aitalic_A is already in SEL, so SEL is not changed.

Related Event Table: A table that holds all the related events corresponding to all suspicious entities. The related events of a suspicious entity refer to the set of all data flows and control flows that lead to this entity’s semantic change. Specifically, SParse will maintain a separate set of related events in RET for all suspicious entities. Whenever an event e=<Us,Uo,O,Ti,D>e=<U_{s},\ U_{o},\ O,\ T_{i},\ D>italic_e = < italic_U start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT , italic_O , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_D > that satisfies the suspicious semantic transfer rule occurs, SParse will stitch the related events of Ussubscript𝑈𝑠U_{s}italic_U start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT with event e𝑒eitalic_e and use it as the related events of Uosubscript𝑈𝑜U_{o}italic_U start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT to update RET. As shown in Section V-C, the size of the RET is much smaller than the raw audit logs and the time of read in is negligible. SParse can construct a suspicious semantic graph related to the POI event in real-time.

As shown in Figure 3, when T=1𝑇1T=1italic_T = 1, SParse adds to RET with A:{1}:𝐴1A:\ \{1\}italic_A : { 1 }, indicating that the related event of the suspicious entity A𝐴Aitalic_A is {1}1\{1\}{ 1 }. When T=4𝑇4T=4italic_T = 4, SParse adds to RET with D:{1,3,4}:𝐷134D:\ \{1,3,4\}italic_D : { 1 , 3 , 4 }, stitched from the related events {1,3}13\{1,3\}{ 1 , 3 } of subject C𝐶Citalic_C and the current event {4}4\{4\}{ 4 }. When T=5𝑇5T=5italic_T = 5, SParse updates RET with A:{1,3,5}:𝐴135A:\ \{1,3,5\}italic_A : { 1 , 3 , 5 }, stitched from the related events {1,3}13\{1,3\}{ 1 , 3 } of subject C𝐶Citalic_C and the current event {5}5\{5\}{ 5 }.

In order to speed up the consumption of log streams, SParse keeps the whole SEL in memory to determine the entity states and save suspicious entities in real-time. In contrast, inspired by the CPU architecture, SParse keeps only some of the high-modification (frequent growth in a short period) RETs in memory and stores other low-modification RETs in the hard disk. According to our experimental results (see Section V-C for detail), the memory overhead of SParse is 30MB on average, and there is no problem with high memory overhead. Note that we default sockets to suspicious entities, so only entities of file type and process type are saved in SEL.

In summary, SParse will use these two data structures to enable efficient storage of the necessary data and real-time construction of the suspicious semantic graph. As shown in lines 11 to 17 of Algorithm 1, SParse will add the object vUsubscript𝑣𝑈v_{U}italic_v start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT to the SEL for any event e=<Us,Uo,O,Ti,D>e=<U_{s},\ U_{o},\ O,\ T_{i},\ D>italic_e = < italic_U start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT , italic_O , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_D > that satisfies the semantic transfer rule. In addition, SParse stitches the related events of Ussubscript𝑈𝑠U_{s}italic_U start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT with event e𝑒eitalic_e and uses it as the related events of Uosubscript𝑈𝑜U_{o}italic_U start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT to update RET. Finally, for any given POI event, SParse is able to extract all relevant events for the object of that POI event from the RET in real-time. SParse then uses a simple graph construction algorithm, which extracts entities from entities as nodes and events as edges, to construct a suspicious semantic graph associated with the POI event.

IV-D Edge Compaction

A suspicious semantic graph often contains multiple parallel edges between two nodes. This is because operating systems typically complete read/write tasks (e.g., file read/write) by proportionally allocating data to multiple system calls. Inspired by recent work for graph reduction [44], SParse merges the edges between two nodes if the time difference between them is less than a given threshold. We ultimately chose 10 seconds as it demonstrates reasonable results in terms of various system calls, such as file transfers and network connections.

Refer to caption
Figure 4: Suspicious flow path extraction and path-level contextual scoring.

IV-E Suspicious Flow Path Extraction

In order to perform path-level contextual analysis, it is first necessary to identify possible propagation paths of the data/control flow in the suspicious semantic graph (i.e., suspicious flow paths). SParse proposes a suspicious flow path extraction algorithm that can efficiently handle complex graph structures. In brief, as shown in Figure 4, SParse transforms the suspicious semantic graph into a multiway tree and then traverses it to obtain all suspicious flow paths.

Specifically, as shown in lines 1 to 5 of Algorithm 2, Q𝑄Qitalic_Q and V𝑉Vitalic_V are the queue structures, where Q𝑄Qitalic_Q holds the events to be traversed and V𝑉Vitalic_V holds the events that have been traversed. T𝑇Titalic_T is the multiway tree structure, which holds the topological information. As shown in lines 6 to 9 of Algorithm 2, SParse traverses event e𝑒eitalic_e, identifying all incoming edge ies𝑖𝑒𝑠iesitalic_i italic_e italic_s (ies=G.inEdges(eUs)formulae-sequence𝑖𝑒𝑠𝐺𝑖𝑛𝐸𝑑𝑔𝑒𝑠subscript𝑒subscript𝑈𝑠ies=G.inEdges(e_{U_{s}})italic_i italic_e italic_s = italic_G . italic_i italic_n italic_E italic_d italic_g italic_e italic_s ( italic_e start_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT )). As shown in lines 10-13 of Algorithm 2, SParse determines that the incoming edge ie𝑖𝑒ieitalic_i italic_e (ieies𝑖𝑒𝑖𝑒𝑠ie\in iesitalic_i italic_e ∈ italic_i italic_e italic_s) occurred earlier than event e𝑒eitalic_e and has not been traversed (ieV𝑖𝑒𝑉ie\notin Vitalic_i italic_e ∉ italic_V), then creates node ie𝑖𝑒ieitalic_i italic_e in the multiway tree T𝑇Titalic_T and the parent of that node is e𝑒eitalic_e. Finally, SParse traverses the multiway tree T𝑇Titalic_T to obtain all paths from the root node to the leaf nodes, which are output as suspicious flow paths.

The suspicious flow path extraction algorithm takes into account the timeliness and directionality of the data/control flow and is able to handle the complex graph structure efficiently, as demonstrated in Section V-C, where SParse extracts over 140 suspicious flow paths in one second on average. Finally, it is important to note that events exist as nodes in the multiway tree and suspicious flow paths, as shown in Figure 4.

IV-F Path-level Contextual Scoring

After extracting the suspicious flow paths, SParse needs to perform contextual analysis at the path-level to quantify the degree of influence of the entire path on the POI event (Key Insight (1) in Section IV-A2). Furthermore, the degree of impact between events is determined by the event attributes and the neighboring relationships between events (Key Insight (2) in Section IV-A2).

For each suspicious flow path p𝑝pitalic_p, SParse calculates the PathScore𝑃𝑎𝑡𝑆𝑐𝑜𝑟𝑒PathScoreitalic_P italic_a italic_t italic_h italic_S italic_c italic_o italic_r italic_e using the following equation:

PathScore=EeEventScore(e)/Len(p)𝑃𝑎𝑡𝑆𝑐𝑜𝑟𝑒superscriptsubscript𝐸𝑒𝐸𝑣𝑒𝑛𝑡𝑆𝑐𝑜𝑟𝑒𝑒𝐿𝑒𝑛𝑝PathScore=\sum_{E}^{e}EventScore(e)\ /\ Len(p)italic_P italic_a italic_t italic_h italic_S italic_c italic_o italic_r italic_e = ∑ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT italic_E italic_v italic_e italic_n italic_t italic_S italic_c italic_o italic_r italic_e ( italic_e ) / italic_L italic_e italic_n ( italic_p ) (1)

where e𝑒eitalic_e denotes an event and E𝐸Eitalic_E denotes the set of all events contained in the path (eE)𝑒𝐸(e\in E)( italic_e ∈ italic_E ). EventScore𝐸𝑣𝑒𝑛𝑡𝑆𝑐𝑜𝑟𝑒EventScoreitalic_E italic_v italic_e italic_n italic_t italic_S italic_c italic_o italic_r italic_e denotes the degree of impact of event e𝑒eitalic_e on the parent node, as defined later. Len(p)𝐿𝑒𝑛𝑝Len(p)italic_L italic_e italic_n ( italic_p ) denotes the number of events in the path and is used to normalize the PathScore𝑃𝑎𝑡𝑆𝑐𝑜𝑟𝑒PathScoreitalic_P italic_a italic_t italic_h italic_S italic_c italic_o italic_r italic_e.

SParse calculates the EventScore𝐸𝑣𝑒𝑛𝑡𝑆𝑐𝑜𝑟𝑒EventScoreitalic_E italic_v italic_e italic_n italic_t italic_S italic_c italic_o italic_r italic_e using the following equation:

EventScore=αImpact(e,f)child(f)sImpact(s,f),f=parent(e)formulae-sequence𝐸𝑣𝑒𝑛𝑡𝑆𝑐𝑜𝑟𝑒𝛼𝐼𝑚𝑝𝑎𝑐𝑡𝑒𝑓superscriptsubscript𝑐𝑖𝑙𝑑𝑓𝑠𝐼𝑚𝑝𝑎𝑐𝑡𝑠𝑓𝑓𝑝𝑎𝑟𝑒𝑛𝑡𝑒EventScore=\alpha\frac{Impact(e,f)}{\sum_{child(f)}^{s}Impact(s,f)}\ ,\ f=% parent(e)italic_E italic_v italic_e italic_n italic_t italic_S italic_c italic_o italic_r italic_e = italic_α divide start_ARG italic_I italic_m italic_p italic_a italic_c italic_t ( italic_e , italic_f ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c italic_h italic_i italic_l italic_d ( italic_f ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT italic_I italic_m italic_p italic_a italic_c italic_t ( italic_s , italic_f ) end_ARG , italic_f = italic_p italic_a italic_r italic_e italic_n italic_t ( italic_e ) (2)
α=1+len(child(f)1)C𝛼1𝑙𝑒𝑛𝑐𝑖𝑙𝑑𝑓1𝐶\alpha=1+\frac{len(child(f)-1)}{C}italic_α = 1 + divide start_ARG italic_l italic_e italic_n ( italic_c italic_h italic_i italic_l italic_d ( italic_f ) - 1 ) end_ARG start_ARG italic_C end_ARG (3)

where parent(e)𝑝𝑎𝑟𝑒𝑛𝑡𝑒parent(e)italic_p italic_a italic_r italic_e italic_n italic_t ( italic_e ) denotes the parent of event e𝑒eitalic_e, and child(f)𝑐𝑖𝑙𝑑𝑓child(f)italic_c italic_h italic_i italic_l italic_d ( italic_f ) denotes all the children of event f𝑓fitalic_f in the multiway tree. Impact(e,f)𝐼𝑚𝑝𝑎𝑐𝑡𝑒𝑓Impact(e,f)italic_I italic_m italic_p italic_a italic_c italic_t ( italic_e , italic_f ) denotes the degree of impact that event e𝑒eitalic_e exerts on event f𝑓fitalic_f, as defined later. α𝛼\alphaitalic_α is an inflation factor to mitigate the problem of decreasing relative impact due to triage (a parent node with multiple children). As shown in Equation 3, α𝛼\alphaitalic_α is controlled by the super parameter C𝐶Citalic_C and the number of child nodes. It is negatively correlated with C𝐶Citalic_C and positively correlated with the number of child nodes.

SParse picks two features (i.e., data flow amount and time), to calculate the Impact𝐼𝑚𝑝𝑎𝑐𝑡Impactitalic_I italic_m italic_p italic_a italic_c italic_t using the following equation:

Impact(e1,e2)=CS(Nor(e1D,e1Ti),Nor(e2D,e2Ti))𝐼𝑚𝑝𝑎𝑐𝑡𝑒1𝑒2𝐶𝑆𝑁𝑜𝑟𝑒subscript1𝐷𝑒subscript1subscript𝑇𝑖𝑁𝑜𝑟𝑒subscript2𝐷𝑒subscript2subscript𝑇𝑖\displaystyle Impact(e1,e2)=CS(Nor(e1_{D},e1_{T_{i}}),Nor(e2_{D},e2_{T_{i}}))italic_I italic_m italic_p italic_a italic_c italic_t ( italic_e 1 , italic_e 2 ) = italic_C italic_S ( italic_N italic_o italic_r ( italic_e 1 start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_e 1 start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_N italic_o italic_r ( italic_e 2 start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_e 2 start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) (4)

where e1𝑒1e1italic_e 1 and e2𝑒2e2italic_e 2 are two events and e2𝑒2e2italic_e 2 is the parent node of e1𝑒1e1italic_e 1 (i.e., e2=parent(e1)𝑒2𝑝𝑎𝑟𝑒𝑛𝑡𝑒1e2=parent(e1)italic_e 2 = italic_p italic_a italic_r italic_e italic_n italic_t ( italic_e 1 )). eDsubscript𝑒𝐷e_{D}italic_e start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT and eTsubscript𝑒𝑇e_{T}italic_e start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT denote the data flow and occurrence time of event e𝑒eitalic_e, respectively. Nor()𝑁𝑜𝑟Nor(\cdot)italic_N italic_o italic_r ( ⋅ ) denotes normalization, which removes differences in eDsubscript𝑒𝐷e_{D}italic_e start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT and eTsubscript𝑒𝑇e_{T}italic_e start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT on the scale. CS()𝐶𝑆CS(\cdot)italic_C italic_S ( ⋅ ) denotes the computation of cosine similarity.

Intuitively, we assume that if the data flow amount between parent node and child node is similar, then there is a causal relation between them (e.g., a process reads 526 bytes from the network and then immediately writes 526 bytes to a file, which may be the same content). Similarly, if the timestamps are similar, then there is a causal relation between the events since we think that the exploitation is automated and its steps quickly follow each other.

SParse will iteratively calculate the scores of all suspicious flow paths and consider the path whose score is below a threshold T𝑇Titalic_T as an irrelevantpath𝑖𝑟𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡𝑝𝑎𝑡irrelevant\ pathitalic_i italic_r italic_r italic_e italic_l italic_e italic_v italic_a italic_n italic_t italic_p italic_a italic_t italic_h. Then, SParse filters out events that only exist in the irrelevantpath𝑖𝑟𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡𝑝𝑎𝑡irrelevant\ pathitalic_i italic_r italic_r italic_e italic_l italic_e italic_v italic_a italic_n italic_t italic_p italic_a italic_t italic_h and outputs the retained part as a critical component graph to help analysts in attack investigation.

V Evaluation

In this section, we first present the evaluation preparation, including the characteristics of the dataset, the obtaining of ground truth, and the setting of evaluation metrics. We then evaluate the effectiveness and efficiency of each component separately. In summary, we aim to answer the following questions:

  • RQ1: How effective is SParse in attack investigation?

  • RQ2: How efficient is SParse in attack investigation?

  • RQ3: How sensitive is SParse in parameter selection?

V-A Evaluation Preparation

We deploy our implementation of SParse on a computer with Intel (R) Core (TM) i9-10900K CPU @ 3.70GHz and 64GB memory. SParse processes streaming logs from the auditing systems Sysdig [41] and SPADE [5], extracts information in the format as described in Section II-A, and runs continuously in a low-overhead state.

V-A1 Attack Dataset

We evaluate the effectiveness of SParse in revealing attack sequences on a dataset with over 150 million system audit logs. As shown in Table IV, this dataset contains 15 attack cases (10 simulated attacks and 5 DARPA attacks), and is provided by DEPIMPACT [18]. The simulated attacks consist of 7 (rows 2 to 8) single-host attacks based on common exploits [45, 28, 44, 26] and 3 (rows 9 to 11) multi-host attacks based on Cyber Kill Chain [46] and CVE reports [45]. The simulated attacks utilized deployed hosts with 12 active users and hundreds of processes, daily tasks such as file manipulation, text editing, and software development were carried out to simulate real-world usage. We detail these 10 simulated attacks in Appendix Appendix 1-A1 and Appendix Appendix 1-A2. The DARPA dataset contains 5 host attacks (rows 12 to 16), which was done by two teams (FiveDirections and Theia), and differed in terms of target systems (Windows, Linux) and vulnerability exploits (pine backdoor, firefox backdoor, and browser extension).

Table IV shows the statistics of the generated dependency graphs for all attacks. Column “Attack” indicates the name of the attack case. Columns “# V” and “# E” indicate the number of nodes and edges of the backtracking graphs after performing causality analysis [29] from POI events. Column “# CE” shows the number of critical events (related to the attack), which we explain in detail below.

TABLE IV: The statistics of dependency graphs generated for all the 15 attacks.
Attack # V # E # CE
Wget Executable 78 349 16
Illegal Storage 2,277 34,367 7
Illegal Storage2 9,345 290,933 7
Hide File 23,110 459,514 10
Steal Information 23,153 495,570 7
Backdoor Download 1,411 12,354 12
Annoying Server User 114 585 15
Shellshock 1,706 42,918 36
Dataleak 1,863 20,807 25
VPN Filter 2,436 39,332 29
Five Dir Case 1 259 473 8
Five Dir Case 3 6,109 83,154 9
Theia Case 1 175,196 794,341 8
Theia Case 3 281,001 1,137,829 8
Theia Case 5 245 1,309 5
Avg 35,220.20 227,589.00 13.47
TABLE V: Performance of dependency graphs generated by different technique. SSGC and PCA are the components of SParse.
Attack SLEUTH NODOZE DEPIMPACT SSGC SSGC+PCA
FP FN # E FP FN # E FP FN # E FP FN # E FP FN # E # SFP
Wget Executable 68 7 77 78 0 94 32 0 48 3 0 19 1 0 17 5
Illegal Storage 2189 5 2191 5686 1 5694 2625 0 2632 172 0 179 47 0 54 85
Illegal Storage2 2072 3 2076 11959 1 11967 1255 0 1262 986 0 993 252 0 259 522
Hide File 4558 5 178703 38919 2 38931 14982 0 14992 1590 0 1600 356 0 366 833
Steal Information 3972 3 179202 21114 1 21122 15774 0 15781 1654 0 1661 453 0 460 868
Backdoor Download 1392 6 9198 232 1 245 2625 0 2637 42 0 54 36 0 48 20
Annoying Server User 2 13 281 88 2 105 39 0 54 3 0 18 1 0 16 6
Shellshock 672 9 4299 1007 4 1047 161 4 197 87 0 123 9 0 45 40
Dataleak 622 7 4279 597 7 629 44 3 69 16 0 47 68 0 83 17
VPN Filter 722 8 5342 310 5 344 189 2 218 86 0 115 72 0 101 37
Five Dir Case 1 97 4 237 291 1 300 17 0 25 8 0 16 3 0 11 11
Five Dir Case 3 277 5 37058 717 1 727 171 0 180 82 0 91 31 0 40 22
Theia Case 1 453 6 264107 92091 2 92101 689 1 697 494 0 502 98 0 106 374
Theia Case 3 231 4 493626 7369 1 7378 1074 1 1082 820 0 828 61 0 129 598
Theia Case 5 2 1 845 395 1 401 3 0 8 9 0 14 2 0 7 15
Avg FP/FN/# E 1154 5.73 1163 16682 2.00 16696 2473 0.73 2487 403 0.00 417 99 0.00 113 230
Avg FPR/FNR(102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT) 0.507 42.54 / 7.329 14.85 / 1.087 5.419 / 0.177 0 / 0.043 0 / /
  • \dagger SSGC:Suspicious Semantic Graph Construction. PCA:Path-level Contextual Analysis. SFP:Suspicious Flow Path.

V-A2 Obtaining Ground Truth

In order to evaluate the performance of SParse, we need to specify the ground truth for all attack cases (i.e., identify all critical events). Specifically, we analyzed the targets of each attack case and determined the corresponding POI events from massive logs. We then conducted back-propagation [29] from POI events to obtain backtracking graphs and searched for critical events within them. Finally, we manually ascertained the critical events based on Indicators of Compromise (e.g., file names and malware names) and attack steps (e.g., download then execution), as shown in Appendix Appendix 1-A.

Evaluation Metrics. First, we measure false positives (FP) and false negatives (FN). False positives refer to those edges that SParse identifies as critical but are not, while false negatives refer to those edges that SParse identifies as irrelevant but are critical. Then we compute the false positive rate FPR=FP/Etotal𝐹𝑃𝑅𝐹𝑃subscript𝐸𝑡𝑜𝑡𝑎𝑙FPR=FP/E_{total}italic_F italic_P italic_R = italic_F italic_P / italic_E start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT and false negative rate FNR=FN/Ec𝐹𝑁𝑅𝐹𝑁subscript𝐸𝑐FNR=FN/E_{c}italic_F italic_N italic_R = italic_F italic_N / italic_E start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, where Etotalsubscript𝐸𝑡𝑜𝑡𝑎𝑙E_{total}italic_E start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT represents the number of edges and Ecsubscript𝐸𝑐E_{c}italic_E start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT represents the number of critical edges, respectively.

V-B RQ1 : How effective is SParse in attack investigation ?

There has been a lot of graph-based related work on attack investigation [12, 47, 48, 21, 44, 18, 15, 13]. However, HOLMES[12] and RapSheet[13] rely solely on defined TTP-like (Tactics, Techniques, and Procedures) rules for detection and investigation. Such approaches suffer from heavy reliance on manual efforts and cannot effectively address zero-day vulnerabilities. HERCULE[47] and WATSON[21] address attack investigation problems by discovering communities (i.e., behavioral abstractions) on the provenance graph. Their purpose is to assist analysts in identifying attack stages from a community perspective, enabling a quick understanding of the purpose of a subgraph in the provenance graph (e.g., file compilation and uploading). HERCULE and WATSON (subgraph-level) differ in granularity from SPARSE (event-level). Hence, we do not compare our work with them.

Here, we compare the performance of SParse with 3 state-of-the-art approaches: SLEUTH [48], NODOZE [15], and DEPIMPACT [18], which are more relevant in terms of methodology (anomaly-score based) and granularity (event-level) for our evaluation. SLEUTH defines TTP-like rules with the added constraint that these rules only fire when certain confidentiality or integrity conditions are satisfied according to a tag-based information flow propagation. NODOZE measures the rarity of different events in the environment and based on this assigns anomaly scores to each event in the dependency graph. We use logs that only contain normal behavior (captured outside of attack periods) as execute profiles (i.e., statistics of events) to satisfy NODOZE. DEPIMPACT assigns anomaly scores to edges based on a number of characteristics (including time, data flow amount, and node access), and then aggregates the scores to determine the entry points through a propagation algorithm. DEPIMPACT then takes as output the overlap events of the forward graph of the entry point and the backward graph of the alert point. Finally, we also perform an ablation experiment on SParse to evaluate the output of different phases.

Table V shows the performance of attack investigation for different techniques in all cases. Lower FP/FPR indicate a better ability to filter irrelevant edges and lower FN/FNR indicate a better ability to retain critical edges. The results show that SParse (SSG + PCA) performs the best. On average, the critical component graph generated by SParse (similar-to\sim 113 edges) is 8849 ×\times× smaller than the original dependency graph (similar-to\sim 1,000,000 edges), 22 ×\times× smaller than the second-best result (i.e., DEPIMAPCT with similar-to\sim 2,487 edges). SParse demonstrates the best capability in filtering irrelevant edges while preserving the attack sequences (FP = 99, FPR = 0.043*102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT), 25 ×\times× more effective than DEPIMPACT (FP = 2,473, FPR = 1.087*102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT). Moreover, SParse does not miss any critical edges (i.e., FN = 0, FNR = 0). Note that Column “SSGC” and Column “SFP” denote the suspicious semantic graph construction and suspicious flow path of the intermediate product of SParse, respectively.

SLEUTH investigates attack scenarios by defining confidentiality and integrity labels for label propagation. However, SLEUTH cannot ensure to cover all the attack-related edges and therefore performed the worst in FNR (FN = 5.73, FPR = 42.54*102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT), resulting in ineffective support for attack investigation. NODOZE performed better than SLEUTH in including critical edges but worse in FPR (FP = 16,682, FPR = 7.329*102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT; FN = 2, FNR = 14.85*102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT), resulting in the ineffective reduction of investigation cost for analysts. The reason is that the performance of NODOZE relies on whether the execution profile comprehensively covers all benign behaviors but ignores information in the form of streams. DEPIMPACT heuristically selects 3 entry points (one each for files, processes, and sockets), which amplifies the attack surface. DEPIMPACT directly takes the intersection between the forward graph of the entry points and the backward graph of the alert point as output, which introduces massive attack-irrelevant events (FP = 2,473, FPR = 1.087*102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT). Regarding the reduction results, SParse exhibits the best performance (FP = 99, FPR = 0.043*102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT), we attribute the good performance of our system to (1) SParse employs insight of semantic transfer to construct a suspicious semantic graph related to the POI event that inherently filters out a significant number of irrelevant events (FP = 403, FPR = 0.177*102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT), and (2) SParse evaluates the relevance of context to POI events at the path-level rather than in isolation.

In summary, we believe that SParse can satisfy the requirement of low false positives in practice.

TABLE VI: Comparison of data consumption rate and data generation rate.
Attack
Case
Generation Rate
(event num/pers)
Consumption Rate
(event num/pers)
Wget Executable 3,018 42,175
Illegal Storage 3,273 45,572
Illegal Storage2 3,156 34,727
Hide File 4,039 31,155
Steal Information 4,271 40,719
Backdoor Dowanload 3,380 39,451
Annoying Server User 3,153 42,155
Shellshock 3,741 42,742
Dataleak 3,692 42,899
VPN Filter 3,560 39,611
Five Dir Case 1 917 40,358
Five Dir Case 3 1,161 44,795
Theia Case 1 1,298 32,097
Theia Case 3 1,215 34,856
Theia Case 5 966 41,303
Avg 2,722 39,641
TABLE VII: Overhead performance of each component of SParse and baseline approach.
Attack SSGC PCA DEPIMPACT
Mem.
(MB)
CPU
(%)
Disk
(MB)
Ti.
(s)
Mem.
(MB)
CPU
(%)
#
SFP
Ti.
(s)
Mem.
(MB)
CPU
(%)
Disk
(MB)
Wget Executable 29.75 3.14 0.072 0.09 108.20 4.97 5 4.45 125.72 7.62 0.79
Illegal Storage 29.92 3.32 7.15 0.32 124.97 4.36 85 285.10 178.89 7.46 79.21
Illegal Storage2 31.81 3.35 69.21 2.38 125.16 4.27 522 985.55 304.51 7.29 169.21
Hide File 34.82 3.59 101.12 6.22 147.34 4.94 833 19,814.25 592.82 7.51 401.12
Steal Information 36.55 3.61 109.17 6.66 155.60 4.90 868 12,701.39 596.26 7.82 409.17
Backdoor Dowanload 28.01 3.27 2.21 0.10 125.16 4.49 20 33.68 199.17 7.42 61.13
Annoying Server User 26.62 3.40 0.011 0.33 119.41 4.41 6 0.23 127.81 7.76 0.70
Shellshock 29.74 3.39 8.94 0.02 137.75 4.13 40 2.17 138.98 7.57 154.00
Dataleak 30.39 3.40 0.82 0.04 132.86 4.29 17 2.24 130.65 7.22 97.50
VPN Filter 31.01 3.26 0.82 0.06 146.08 4.27 37 15.46 135.49 7.43 537.95
Five Dir Case 1 28.83 3.29 7.15 0.04 124.28 4.56 11 2.08 124.12 7.67 0.82
Five Dir Case 3 30.05 3.50 37.71 0.11 135.04 4.29 22 1,674.90 189.08 7.72 88.65
Theia Case 1 31.97 3.43 59.50 2.34 137.35 4.15 374 24,761.90 844.28 7.64 479.20
Theia Case 3 35.08 3.62 68.45 4.85 143.90 4.80 598 36,682.87 885.83 7.79 592.23
Theia Case 5 29.12 3.30 0.13 0.19 129.21 4.67 15 4.91 138.60 7.15 1.41
Avg 30.91 3.39 21.03 1.58 132.82 4.50 230 6464.75 320.81 7.54 204.87
  • \dagger SSGC:Suspicious Semantic Graph Construction. PCA:Path-level Contextual Analysis. SFP:Suspicious Flow Path.

V-C RQ2 : How efficient is SParse in attack investigation ?

In this section, we evaluate the efficiency of SParse when deployed in a real scenario. First, we evaluate SParse on the real-time performance by comparing data generation rate and data consumption rate. Table VI shows that SParse can consume events 15 ×\times× faster than the events generation rate of the host on average, which shows the real-time of SParse is feasible. Then we evaluate the overheads of each component of SParse on time, memory, CPU, and disk, as shown in Table VII. In addition, we conduct comparative experiments with the baseline method DEPIMPACT.

Time and Memory. SParse can perform path-level contextual analysis of the suspicious semantic graph and construct the critical component graph in 2s on average, which is 4091 ×\times× faster than DEPIMPACT (similar-to\sim 6,464s). DEPIMPACT uses an impact propagation algorithm to identify entry points. However, when the number of edges in the backtracking graph is high (e.g., case “Hide File”), it takes an extremely long time to reach global convergence (19,814s). In terms of memory overhead, three causes are making SParse smaller than DEPIMAPCT: (1) SParse only stores suspicious nodes in memory when processing streaming logs, so the memory overhead grows extremely slowly and can be seen as constant in scale (i.e., 30MB). (2) The suspicious semantic graphs (similar-to\sim 403 edges) read in by SParse are much smaller than the dependency graphs (similar-to\sim one million edges) read in by DEPIMPACT. (3) The suspicious flow path extraction algorithm applied in PCA only traverses the graph structure once (a complexity of O(E)𝑂𝐸O(E)italic_O ( italic_E )) and generates a smaller number of SFPs (230 on average). For these reasons, the memory overhead of PCA (132.82 MB) is 2.4 ×\times× smaller than that of DEPIMAPCT (320.81 MB). Finally, it is important to emphasize that SParse is real-time for suspicious semantic graph construction (SSGC), so we do not perform a time overhead evaluation for this component.

CPU and Disk. SParse stores the relevant event table (RET) in the database when processing streaming logs. To evaluate the overhead of SParse on the hard disk, we also perform the relevant experiments. As shown in Table VII, SParse requires only 21.03 MB of disk space, which is 9 ×\times× smaller than the original logs (204.87 MB). This is because SParse only retains events of suspicious semantic relevance, whereas other techniques [10, 15, 18, 28, 17] require the whole logs, which significantly increases overhead on the disk. In addition, the CPU overhead of SParse is 4.5%, due to the simple yet intuitive and effective algorithm.

In summary, we believe that SParse outperforms the latest work DEPIMPACT in all aspects of overhead and can satisfy the requirements of low overhead and latency in application scenarios.

V-D RQ3 : How sensitive is SParse in parameter selection?

Refer to caption
((a)) Hyperparameter matrix on FP𝐹𝑃FPitalic_F italic_P
Refer to caption
((b)) Hyperparameter matrix on FN𝐹𝑁FNitalic_F italic_N
Figure 5: Hyperparameter Matrices for System Performance with Different Parameters.

As shown in Section IV-F, there are two super parameters in SParse that need to be set: (1) the C𝐶Citalic_C affects the expansion factor, (2) the T𝑇Titalic_T filters the paths. As shown in Equation 3, the larger C𝐶Citalic_C, the smaller the expansion factor α𝛼\alphaitalic_α, the lower the path score, and the fewer events will be retained by SParse. Parameter T𝑇Titalic_T, on the other hand, indicates the severity of SParse for path selection; the larger T𝑇Titalic_T is, the fewer events will be retained by SParse.

We demonstrate the sensitivity of SParse in parameter selection by testing all combinations of the hyper-parameters C𝐶Citalic_C and T𝑇Titalic_T through grid search methodology. Specifically, we set the minimum value of C𝐶Citalic_C to be 1, the maximum value to be 9, and the step size to be 1; set the minimum value of T𝑇Titalic_T to be 0.1, the maximum value to be 0.9, and the step size to be 0.1; and test the performance of SParse on the metrics FP/FN as shown in Figure 5. As parameter C𝐶Citalic_C increases, SParse reduces the path score and retains fewer events, hence the FP decreases. At the same time, SParse misses some critical events, causing FN to rise. As the threshold T𝑇Titalic_T increases, SParse blocks more paths and preserves fewer events, so FP falls while FN rises. As shown in Figure 5(a), FP decreases gradually from the upper left to the lower right; as shown in Figure 5(b), FN increases gradually from the upper left to the lower right. The effects of these parameter changes on the SParse are in line with our predictions.

Obviously, FP and FN are two evaluation metrics that we both want to be as low as possible, but there is a trade-off between their performance for a system (i.e., a rise in one leads to a fall in the other). Finally, we choose C𝐶Citalic_C = 5 and T𝑇Titalic_T = 0.5 as the default parameters for SParse. Of course, the manufacturer can adapt these parameters to the specific scenario.

VI Discussion

Cooperation with existing techniques. There is a requirement for defenders to be able to detect and handle real-world attacks in real time. As an attack investigation system, SParse can be combined with a variety of existing techniques to meet this goal. By working with intrusion detection systems [12, 14, 23, 49] that can provide real-time alerts and defenses, SParse is able to investigate alerts for relevant events and provide a brief critical component graph to analysts. By working with compression systems [44, 10, 50, 51, 32] that can reduce redundant information, SParse is able to reduce memory and disk overheads, enabling years of relevant data storage. By working with analysis systems [13, 43, 15] that automatically determine the authenticity of alarms, SParse is able to provide streamlined but sufficient relevant events to support the triage of alarms.

Evasion Attacks. Existing investigation techniques, such as DEPIMAPCT, utilize weight computation and score-propagation techniques to identify attack entry points. However, this insight of independently calculating the weights of events does not fit the situation where information flows between entities. As a result, an attacker can inject a payload by writing multiple times and thus evade tracking. In contrast, SParse mitigates the impact of this by performing contextual analysis in path-level to synthesize the relevance between a path and an alert. An attacker may evade investigation by going the long way around (i.e., repeating nonsensical behavior) as in the attack case “Hide File” where the attacker changes the file name multiple times. As shown in Section IV-F, SParse is able to mitigate this problem by performing relative score calculation and event score inflation mechanisms.

Limitation. To implement attack investigation, SParse relies on alerts initiated by Endpoint Detection and Response (EDR) placed on the host. SParse cannot perform attack investigation if the detection system fails to launch alerts (identify suspicious behavior). Recent approaches [52, 53] propose solutions to improve the detection of anomalous system activity, and SParse can work with these approaches to provide better defenses. If the detection system initiates false positives frequently, SParse can only identify relevant events but cannot filter these false alters. But SParse can work with alarm triage techniques [13, 15, 50] to help them filter false alarms by providing a streamlined critical component graph. In addition, as shown in Section IV-C3, SParse needs to maintain a suspicious entity list in memory and a related event table on disk. As the runtime grows, there is redundant information in the related event table, such as Event 10𝐸𝑣𝑒𝑛𝑡10Event\ 10italic_E italic_v italic_e italic_n italic_t 10 being stored four times in Figure 3. By working with existing compression systems [44, 10, 50, 51, 32], SParse can effectively mitigate this situation and enable long deployment runs.

VII Related Work

The analysts need to perform threat alert validation and post-mortem analysis of incidents. Currently, while auditing is by no means the only form of forensic investigation, it is telling that 75% of incident response specialists consider logs to be the most valuable form of investigation artifact [54]. Several studies have focused on implementing efficient log collection, such as kellect [55] for Windows and e-bpf [56, 57] for Linux, which is a pre-task for attack investigation. And in terms of methodologies for attack investigation using logs, they can be categorized into three categories: label propagation-based, anomaly score-based, and machine learning-based.

Label Propagation-based. Labels are given to nodes and are propagated to other nodes by system calls. When an alert arises, the analyst can easily retrace the events associated with the alert based on the label. Milajerdi et al. propose HOLMES [12], which mitigates the dependency explosion problem by requiring the aggregation of more labels to raise the detection threshold. RapSheet [13] makes use of the tactical provenance graph (TPG), which instead of encoding low-level system event dependencies, reasons about the causal relationships between threat alerts. RapSheet proposes a threat scoring scheme that evaluates the severity of each alert based on TPGs to enable effective investigation of alerts. Zhong et al. [58], on the other hand, mine the analysts’ security operation traces to learn label-propagation rules, and then use these rules to identify relevant paths automatically when an alert occurs. CONAN [23] iteratively performs malicious behavior determination with label passing and aggregation through data provided by ETW, enabling real-time detection and investigation. These label-based approaches rely on heuristic rules that cannot handle all types of attacks and have a high level of false negatives in attack investigation.

Anomaly Score-based. The essential view of these methods is to quantify the suspiciousness of edges between pairs of nodes. Pei et al.’s HERCULE [47] system correlates multi-source heterogeneous logs to construct a multi-dimensional weighted graph and uses the unsupervised community detection algorithm Louvain [59] to discover attack-related paths from it. NODOZE [15] and PRIOTRACKER [17], on the other hand, perform statistics on historical data and assign anomaly scores to events in the dependency graph. The score propagation algorithm is then used to find suspicious events. Both of these methods rely on statistics of historical data and cannot be applied in complex and variable generative environments. DEPIMPACT [18] calculates dependency weights globally based on multiple features (including time, data traffic amount, and node access) and then aggregates the weights to nodes to determine suspicious points of intrusion. The overlapping parts between the forward graph of the entry point and the backward graph of the alarm point are then considered attack-related events. These score-based methods can cover all critical events. However, there is no restriction on the variation of scores to solve the dependency explosion problem, thus leading to higher false positives. In addition, these methods require reading in all the logs to build the dependency graph, which is very expensive in terms of hard disk and memory.

Machine Learning-based. Some techniques use machine learning methods to learn contextual and structural information from dependency graphs to identify the most relevant abnormal events to alert. ATLAS [19] uses a novel combination of causal analysis, natural language processing, and machine learning to construct sequence-based models as a way to establish critical patterns of attack and non-attack behavior in the dependency graph. On the other hand, DEPCOMM [20] proposes a novel graph summarization method by dividing the large graph into process-centric subgraphs. DEPCOMM then extracts summaries from each subgraph, enabling the generation of summary graphs from dependency graphs, thereby reducing the difficulty of investigation for analysts. WATSON [21] automatically abstracts and clusters high-level system behavioral features from low-level audit events. WATSON performs a Depth-First Search on each object to summarise system behavior and then uses machine learning to infer the semantics of each audit event based on its context. Finally, behaviors with similar semantics are aggregated in the embedding space to identify similar events to the alert in order to investigate the attack. These machine learning-based approaches suffer from inadequate training samples, poor generalization capabilities, and high computational costs.

In contrast to previous work, SParse employs a hybrid method in the specific domain of causality tracking. SParse first uses suspicious semantic delivery rule to construct suspicious semantic graph. Then SParse uses path-level contextual analysis to extract a streamlined critical component graph.

VIII Conclusion

We propose SParse, a system that processes streaming logs and outputs critical events (attack-related events) according to alert in real-time. Specifically, SParse constructs a suspicious semantic graph related to the POI event by suspicious semantic transfer rule and storage strategy. Then SParse uses a suspicious flow path extraction algorithm to extract all reachable flow paths from the suspicious semantic graph. Finally, SParse uses path-level contextual analysis to score all paths and filters irrelevant events to obtain the final critical component graph. Our evaluation of real attacks demonstrates that SParse achieves low false positives (FP = 99), low overhead (30MB for memory and 21.03MB for hard disk), and low latency (1.58s for attack investigation).

References

  • [1] “What twitter’s 200 million-user email leak actually means,” https://www.wired.com/story/twitter-leak-200-million-user-email-addresses/.
  • [2] “Mitre att&ck,” https://attack.mitre.org/.
  • [3] “System administration utilities,” https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/security_guide/chap-system_auditing.
  • [4] “About event tracing,” https://docs.microsoft.com/en-us/windows/win32/etw/about-event-tracing.
  • [5] A. Gehani and D. Tariq, “Spade: Support for provenance auditing in distributed environments,” in ACM/IFIP/USENIX Int. Middleware Conf., MIDDLEWARE.   Springer, 2012, pp. 101–120.
  • [6] S. Ma et al., “Kernel-supported cost-effective audit logging for causality tracking,” in USENIX ATC, 2018, pp. 241–254.
  • [7] A. Bates et al., “Trustworthy whole-system provenance for the linux kernel,” in USENIX), 2015, pp. 319–334.
  • [8] M. A. Inam et al., “Sok: History is a vast early warning system: Auditing the provenance of system intrusions,” in S&P.   IEEE, 2022, pp. 307–325.
  • [9] K. H. Lee et al., “High accuracy attack provenance via binary-based execution partition.” in NDSS, vol. 16, 2013.
  • [10] Y. Tang et al., “Nodemerge: Template based efficient data reduction for big-data causality analysis,” in CCS, 2018, pp. 1324–1337.
  • [11] Z. Xu et al., “High fidelity data reduction for big data security dependency analyses,” in CCS, 2016, pp. 504–516.
  • [12] S. M. Milajerdi et al., “Holmes: real-time apt detection through correlation of suspicious information flows,” in S&P.   IEEE, 2019, pp. 1137–1152.
  • [13] W. U. Hassan et al., “Tactical provenance analysis for endpoint detection and response systems,” in S&P.   IEEE, 2020, pp. 1172–1189.
  • [14] T. Zhu et al., “Aptshield: A stable, efficient and real-time apt detection system for linux hosts,” TDSC, 2023.
  • [15] W. U. Hassan et al., “Nodoze: Combatting threat alert fatigue with automated provenance triage,” in NDSS, 2019.
  • [16] Hassan et al., “This is why we can’t cache nice things: Lightning-fast threat hunting using suspicion-based hierarchical storage,” in ACSAC, 2020, pp. 165–178.
  • [17] Y. Liu et al., “Towards a timely causality analysis for enterprise security.” in NDSS, 2018.
  • [18] P. Fang et al., “Back-propagating system dependency impact for attack investigation,” in USENIX, 2022, pp. 2461–2478.
  • [19] A. Alsaheel et al., “Atlas: A sequence-based learning approach for attack investigation.” in USENIX, 2021, pp. 3005–3022.
  • [20] Z. Xu, P. Fang, C. Liu, X. Xiao, Y. Wen, and D. Meng, “Depcomm: Graph summarization on system audit logs for attack investigation,” in 2022 IEEE Symposium on Security and Privacy (SP).   IEEE, 2022, pp. 540–557.
  • [21] J. Zeng et al., “Watson: Abstracting behaviors from audit logs via aggregation of contextual semantics.” in NDSS, 2021.
  • [22] W. U. Hassan et al., “Tactical provenance analysis for endpoint detection and response systems,” in S&P.   IEEE, 2020, pp. 1172–1189.
  • [23] C. Xiong et al., “Conan: A practical real-time apt detection system with high accuracy and efficiency,” TDSC, vol. 19, no. 1, pp. 551–565, 2020.
  • [24] “Lateral movement,” https://www.crowdstrike.com/cybersecurity-101/lateral-movement/.
  • [25] “Apt notes,” https://github.com/aptnotes/data/.
  • [26] Y. Kwon et al., “Mci: Modeling-based causality inference in audit logging for attack investigation.” in NDSS, vol. 2, 2018, p. 4.
  • [27] P. Gao et al., “{{\{{AIQL}}\}}: Enabling efficient attack investigation from system monitoring data,” in USENIX, 2018, pp. 113–126.
  • [28] S. Ma et al., “Protracer: Towards practical provenance tracing by alternating between logging and tainting.” in NDSS, vol. 2, 2016, p. 4.
  • [29] S. T. King and P. M. Chen, “Backtracking intrusions,” in SOSP, 2003, pp. 223–236.
  • [30] “Darpa.” https://www.darpa.mil/program/transparent-computing.
  • [31] “Darap3 transparent engagement 3,” 2023, https://drive.google.com/drive/folders/1QlbUFWAGq3Hpl8wVdzOdIoZLFxkII4EK.
  • [32] T. Zhu et al., “General, efficient, and real-time data compaction strategy for apt forensic analysis,” TIFS, vol. 16, pp. 3312–3325, 2021.
  • [33] P. Gao et al., “Enabling efficient cyber threat hunting with cyber threat intelligence,” in ICDE.   IEEE, 2021, pp. 193–204.
  • [34] Gao et al., “{{\{{SAQL}}\}}: A stream-based query system for real-time abnormal system behavior detection,” in USENIX, 2018, pp. 639–656.
  • [35] D. Wagner and P. Soto, “Mimicry attacks on host-based intrusion detection systems,” in CCS, 2002, pp. 255–264.
  • [36] M. Bishop et al., Introduction to computer security.   Addison-Wesley Boston, 2005, vol. 50.
  • [37] C. Kruegel et al., Intrusion detection and correlation: challenges and solutions.   Springer Science & Business Media, 2004, vol. 14.
  • [38] “Insider threat monitoring software,” 2023, https://www.netwrix.com/insider_threat_detection.html.
  • [39] “Auditd,” 2023, https://linux.die.net/man/8/auditd.
  • [40] “Lttng,” 2023, https://lttng.org.
  • [41] “Sysdig,” 2023, https://github.com/draios/sysdig.
  • [42] “Redhat,” 2023, https://github.com/linux-audit/.
  • [43] W. U. Hassan et al., “This is why we can’t cache nice things: Lightning-fast threat hunting using suspicion-based hierarchical storage,” in ACSAC, 2020, pp. 165–178.
  • [44] Z. Xu et al., “High fidelity data reduction for big data security dependency analyses,” in CCS, 2016, pp. 504–516.
  • [45] “Exploit database,” Exploit Database, https://www.exploit-db.com/.
  • [46] “Cyber kill chain,” 2023, https://www.lockheedmartin.com/en-us/capabilities/cyber/cyber-kill-chain.html.
  • [47] K. Pei et al., “Hercule: Attack story reconstruction via community discovery on correlated log graph,” in ACSAC, 2016, pp. 583–595.
  • [48] M. N. Hossain et al., “Sleuth: Real-time attack scenario reconstruction from cots audit data.” in USENIX, 2017, pp. 487–504.
  • [49] T. Kim, X. Wang, N. Zeldovich, M. F. Kaashoek et al., “Intrusion recovery using selective re-execution.” in OSDI, 2010, pp. 89–104.
  • [50] M. N. Hossain et al., “Dependence-preserving data compaction for scalable forensic analysis,” in USENIX, 2018, pp. 1723–1740.
  • [51] N. Michael et al., “On the forensic validity of approximated audit logs,” in ACSAC, 2020, pp. 189–202.
  • [52] S. Wang et al., “Heterogeneous graph matching networks,” arXiv preprint arXiv:1910.08074, 2019.
  • [53] X. Han et al., “Unicorn: Runtime provenance-based detector for advanced persistent threats,” arXiv preprint arXiv:2001.01525, 2020.
  • [54] “Carbon black,” https://www.carbonblack.com/global-incident-response-threatreport/november-2018/.
  • [55] T. Chen, Q. Song, X. Qiu, T. Zhu, Z. Zhu, and M. Lv, “Kellect: a kernel-based efficient and lossless event log collector,” arXiv preprint arXiv:2207.11530, 2022.
  • [56] J. Byrnes et al., “A modern implementation of system call sequence based host-based intrusion detection systems,” in TPS-ISA.   IEEE, 2020, pp. 218–225.
  • [57] S.-Y. Wang et al., “Design and implementation of an intrusion detection system by using extended bpf in the linux kernel,” JNCA, vol. 198, p. 103283, 2022.
  • [58] C. Zhong et al., “Automate cybersecurity data triage by leveraging human analysts’ cognitive process,” in HPSC.   IEEE, 2016, pp. 357–363.
  • [59] V. D. Blondel et al., “Fast unfolding of communities in large networks,” Journal of statistical mechanics: theory and experiment, vol. 2008, no. 10, p. P10008, 2008.
  • [60] “Vpnfilter: New router malware with destructive capabilities,” 2018, https://symc.ly/2IPGGVE.
  • [61] “Ebay.” Ebay Inc. to ask Ebay users to change pass-words, 2014, http://blog.ebay.com/ebay-inc-ask-ebay-users-change-passwords/.
  • [62] “Schneier security: Router vulnerability the vpnfilter botnet,” 2018, https://www.schneier.com/blog/archives/2018/06/router_vulnerab.html.

Appendix Appendix 1 Appendix

Appendix 1-A Attack Cases

In this section, we show the ground truth for the 10 attack cases used for evaluation in Section V-A2. As shown in Figure 6 and Figure 7, for entities, we use rectangles (ProcessName,PrcoessID𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑁𝑎𝑚𝑒𝑃𝑟𝑐𝑜𝑒𝑠𝑠𝐼𝐷\langle ProcessName,\ PrcoessID\rangle⟨ italic_P italic_r italic_o italic_c italic_e italic_s italic_s italic_N italic_a italic_m italic_e , italic_P italic_r italic_c italic_o italic_e italic_s italic_s italic_I italic_D ⟩) for processes, ellipses (FileNamedelimited-⟨⟩𝐹𝑖𝑙𝑒𝑁𝑎𝑚𝑒\langle FileName\rangle⟨ italic_F italic_i italic_l italic_e italic_N italic_a italic_m italic_e ⟩) for files, and diamonds (SrcIP:SrcPortdelimited-⟨⟩:𝑆𝑟𝑐𝐼𝑃𝑆𝑟𝑐𝑃𝑜𝑟𝑡\langle SrcIP\ :\ SrcPort\rangle⟨ italic_S italic_r italic_c italic_I italic_P : italic_S italic_r italic_c italic_P italic_o italic_r italic_t ⟩) for sockets. For events, we use solid lines with arrows, where the arrows indicate the flow of information. In addition, the number on the solid line indicates the relative time of the event. The solid red line indicates the POI event that triggered the alarm.

Appendix 1-A1 Attacks Based on Commonly Used Exploits

These 7 attacks are applied in the evaluations of previous works [28, 44, 18, 26], and consisted of the following scenarios:

  • Wget Executable [44]: Unsecured servers expose a vulnerability, allowing unauthorized users to fetch executable Python scripts through wget and execute them, as shown in Figure 6(a).

  • Illegal Storage [28]: Leveraging wget, a server administrator retrieves suspicious files and deposits them into a user’s home directory, as shown in Figure 6(b).

  • Illegal Storage 2 [28]: Leveraging curl, a server administrator retrieves suspicious files and deposits them into a user’s home directory, as shown in Figure 6(c).

  • Hide File [26]: With the intention of concealing a malicious file among user’s normal files, the attacker downloads a script and obfuscates it by altering the filename and location, as shown in Figure 6(d).

  • Steal Information [28]: The attacker steals user’s sensitive data and stores it in a covert file, avoiding detection, as shown in Figure 6(e).

  • Backdoor Download [28]: A malicious insider establishes a connection to a rogue server using the ping command. Subsequently, the insider downloads a concealed backdoor script and hides the script, as shown in Figure 6(f).

  • Annoying Server User [26]: A malicious user, gaining access to other users’ home directories, injects superfluous data into their files, as shown in Figure 6(g).

Appendix 1-A2 Multi-host Intrusive Attacks

In Attack 1, known as Shellshock Penetration, the attacker, following the initial exploit on Host 1, establishes a connection to cloud services (e.g., Dropbox, Twitter). Here, an image containing the C2 server’s IP address encoded in the EXIF metadata is downloaded. This tactic, reminiscent of advanced persistent threat (APT) attacks [60, 61], aims to evade network-based detection systems relying on DNS blacklisting. Leveraging the obtained IP address, the attacker proceeds to download malware from the C2 server to Host 1. Upon execution of the script, an examination of the ssh configuration file ensues, revealing reachable hosts in the network, including Host 2, Host 3, and Host 4. Subsequently, the malware fetches another script from the C2 server and disseminates it to the identified hosts, extracting passwords in the process, as shown in Figure 7(a).

In Attack 2, known as Data Leakage, the attacker, post-reconnaissance, acquires another malware, leak_data.sh, from the C2 server, distributing it to Host 2. This malware scans for concealed files and files containing sensitive strings, compressing them into a tarball named leak.tar.bz2. The compressed tarball is then transmitted back to Host 1, where it undergoes encryption before being uploaded to the internet, as shown in Figure 7(b).

In Attack 3, known as VPN Filter [60], focuses on sustaining a direct connection to victim hosts from the C2 server. The attacker employs the notorious VPN Filter malware [62] to build initial breach on Host 1 and discover Host 2. Then attacker downloads the VPN Filter stage 1 malware from the C2 server to Host 1, subsequently transferring it to Host 2. This malware initiates the download of another executable from the C2 server, executing it to launch the attack and establish a connection with the C2 server. Through this established connection, the attacker transfers a malicious script to Host 2, aimed at gathering sensitive data on the compromised host, as shown in Figure 7(c).

Figure 6: The ground truth of 7 single-host attack cases.
Refer to caption
((a)) Wget Executable
Refer to caption
((b)) Illegal Storage
Refer to caption
((c)) Illegal Storage 2
Refer to caption
((d)) Hide File
Refer to caption
((e)) Steal Information
Refer to caption
((f)) Backdoor Download
Refer to caption
((g)) Annoying Server User
Figure 7: The ground truth of 3 multi-host attack cases.
Refer to caption
((a)) Shesllshock
Refer to caption
((b)) Dataleak
Refer to caption
((c)) VPN Filter