SPARSE: Semantic Tracking and Path Analysis for Attack Investigation in Real-time

Jie Ying, Tiantian Zhu*, Wenrui Cheng, Qixuan Yuan, Mingjun Ma, Chunlin Xiong, Tieming Chen, Mingqi Lv, and Yan Chen, IEEE Fellow

Abstract

As the complexity and destructiveness of Advanced Persistent Threat (APT) increase, there is a growing tendency to identify a series of actions undertaken to achieve the attacker’s target, called attack investigation. Currently, analysts construct the provenance graph to perform causality analysis on Point-Of-Interest (POI) event for capturing critical events (related to the attack). However, due to the vast size of the provenance graph and the rarity of critical events, existing attack investigation methods suffer from problems of high false positives, high overhead, and high latency.

To this end, we propose SParse, an efficient and real-time system for constructing critical component graphs (i.e., consisting of critical events) from streaming logs. Our key observation is 1) Critical events exist in a suspicious semantic graph (SSG) composed of interaction flows between suspicious entities, and 2) Information flows that accomplish attacker’s goal exist in the form of paths. Therefore, SParse uses a two-stage framework to implement attack investigation (i.e., constructing the SSG and performing path-level contextual analysis). First, SParse operates in a state-based mode where events are consumed as streams, allowing easy access to the SSG related to the POI event through semantic transfer rule and storage strategy. Then, SParse identifies all suspicious flow paths (SFPs) related to the POI event from the SSG, quantifies the influence of each path to filter irrelevant events. Our evaluation on a real large-scale attack dataset shows that SParse can generate a critical component graph ( $\sim$ 113 edges) in 1.6 seconds, which is 2014 $\times$ smaller than the backtracking graph ( $\sim$ 227,589 edges). SParse is 25 $\times$ more effective than other state-of-the-art techniques in filtering irrelevant edges.

Index Terms:

Advanced Persistent Threat, Intrusion/Anomaly Detection and Investigation, Data Provenance.

I Introduction

As the Internet has developed over time, APT attacks have grown more sophisticated and destructive. APT attacks target mainly large corporations such as Twitter [1], resulting in significant financial losses and reputational damage. In addition, APT attacks are executed in multiple stages, which include initial access, persistence, lateral movement, collection, and exfiltration [2].

While an intrusion may be noticed at any stage, detection only uncovers isolated traces of the attack. As a result, analysts must undertake causality analysis to capture the bigger picture and obtain a sound understanding of the detected attack point. Achieving a secure system recovery after a cyber attack requires certain key steps. First, analysts must determine how the adversary infiltrated the system. Once the point of entry is identified, then analysts need to assess the obvious and hidden damage done to the system, such as installed payload, modified files, and exfiltrated information. In short, analysts need to identify the sequence of critical events leading up to the POI event and reconstruct the critical component (subgraph consisting of critical events), which is also called attack investigation.

With the improvement of kernel-level monitoring frameworks [3, 4, 5], more and more causality analysis systems depend on a provenance graph consisting of entities (e.g., files, processes, and sockets) and inter-entity interactions (e.g., processes reading and writing files). However, the auditing framework is known to generate a large number of logs, up to several gigabytes per day on a single machine [6, 7], resulting in a massive graph with billions of edges. This leads to critical events that cause the attack to be drowned out by irrelevant events of normal behavior. Also, a provenance graph is a coarse-grained data format that cannot directly determine the specific dependencies between all relevant events of an entity (e.g., a process has multiple read-in events and write-out events) [8]. The dependency explosion problem [9, 10, 11] caused by these conditions leads to the poor performance of existing causality analysis systems.

TABLE I: Comparison table of related work on attack investigation performance. Column 5 (Storage of Historical Data) indicates whether to store the raw audit logs. The solidness of the marked circle reflects the degree: High (●), Medium (◑), Low (○).

Technique

System

False Positive

Rate

False Negative

Rate

Storage of

Historical Data

Memory

Overhead

Time

Overhead

Label Propagation-based

HOLMES [12]

◑

●

○

●

○

RapSheet [13]

◑

●

APTSHIELD [14]

◑

●

○

Anomaly Score-based

NODOZE [15]

●

○

●

Swift [16]

●

○

●

PRIOTRACKER [17]

●

○

●

DEPIMPECT [18]

◑

○

●

Machine Learning-based

ATLAS [19]

○

◑

●

◑

DEPCOMM [20]

◑

●

◑

WASTON [21]

◑

●

◑

SParse

○

Methodologically, causality analysis can be classified into three categories: label propagation-based, anomaly score-based, and machine learning-based. Specifically, the label propagation-based approach [12, 22, 23, 14] sets entity labels and transformation rules through heuristic rules but suffers from a reliance on heavy manual effort and the incapacity to address zero-day vulnerabilities. The anomaly score-based approach [18, 15, 17, 16] quantifies the suspiciousness of dependency between entities, but faces challenges such as relying on historical statistics and the inability to adapt to complex enterprise production environments. The machine learning-based approach [19, 20, 21] employs neural networks to learn from attack samples but is hindered by insufficient sample size, poor generalization capability, and high computational overhead. These issues, as shown in Table I, make it challenging for analysts to conduct attack investigation within the optimal time (10 minutes) [24] while handling massive alerts. In summary, a general, efficient, and cost-effective causality analysis system needs to meet the following three requirements: 1) Reduced False Positives to address dependency explosion, 2) Affordable Overhead to reduce the cost of attack investigation, and 3) Minimal Latency to prevent further losses caused by subsequent attacks.

Key Insight. To meet the above requirements, after researching hundreds of APT attack descriptions [25] and analyzing numerous related dependency graphs [26, 27, 28, 17, 18, 19], we have the following two key insights. Firstly, critical events must exist in the suspicious semantic graph (SSG) formed by interactions between suspicious entities. Specifically, we construct the SSG consisting of suspicious entities (e.g., a process visiting an unknown website) and suspicious events (data and control flows initiated by the suspicious entities). We believe that the SSG contains all critical events (i.e., critical events are a subset of suspicious events) and is much smaller than the subgraph obtained by backtracking [29] from the POI event, as shown in Figure 1. Secondly, information flows that accomplish goals exist in the form of paths. In other words, we opine that evaluating whether to filter an event cannot be done in isolation, but rather calls for a comprehensive assessment of flow paths consisting of multiple events. Consequently, we construct all suspicious flow paths (SFPs) related to the POI event from SSG and quantify the degree of influence based on the properties of the POI event and path structural characteristics to weed out irrelevant events. In summary, we achieve the attack investigation through a two-stage step (i.e., SSG construction and path-level contextual analysis).

In summary, this paper proposes SParse¹¹1SParse short for Semantic tracking and Path Analysis foR attack inveStigation in real-timE and makes the following contributions:

•

We propose a state-based framework that contains suspicious semantic transfer rule and suspicious event storage strategy. The framework consumes events as streams in low overhead without recording historical data. In addition, the framework can output suspicious semantic graph related to the POI event in real time. The graph consists of all suspicious data flows and control flows that lead to the POI event, and thus contains all attack-related critical events. It is phase I of SParse for filtering semantic-irrelevant events.
•

We propose a path-level contextual analysis mechanism that incorporates suspicious flow path extraction and scoring. It utilizes an optimized BFS algorithm to extract all suspicious flow paths (SFPs) from the SSG. Then the mechanism combines the properties of the POI event and characteristics of the path structure to quantify the impact of each SFP on the POI event. Finally, it filters all events that only exist in SFPs with low scores. It is phase II of SParse for filtering impact-irrelevant events.
•

We implemented SParse and evaluated all its components in detail on a large-scale dataset with more than 150 million logs. Specifically, the dataset contains 10 simulated attacks [18] ( $\sim$ 100 million logs) and 5 attacks from the DARPA TC program [30, 31] ( $\sim$ 50 million logs). Experimental results show that SParse can generate the critical component graph ( $\sim$ 113 edges) in 1.6s, which is 2014 $\times$ smaller than the dependency graph ( $\sim$ 227,589 edges). The critical component graph (FP = 99) generated by SParse is 25 $\times$ more effective than other state-of-the-art causality analysis techniques (FP = 2,473) in filtering irrelevant edges while preserving the attack sequences. In addition, SParse can run for a long time while processing streaming logs with a low memory overhead (30MB).

II Background and Motivation

II-A Dependency Graph

Recent literature has leveraged the concept of data provenance, i.e., instead of manually piecing together individual evidence from raw logs, provenance-based systems can construct dependency graphs that explain the relationships between each event, simplifying the attack investigation. Specifically, a dependency graph $G(E,\ V)$ is a heterogeneous graph consisting of nodes $V$ representing system entities and edges $E$ representing inter-entity events. The attributes of entities and events are carefully selected from raw audit logs, which are lean and critical. For entities, we choose processes ( $\langle ProcessName,\ ProcessID\rangle$ ) , files ( $\langle FileName\rangle$ ), and sockets ( $\langle IP\ :\ Port\rangle$ ). For events, we made selections as shown in Table II. For any edge $e\in E$ , there is $e=(u,v,t)$ , where $u$ represents the subject, $v$ represents the object, and $t$ represents the timestamp of the event. For the two edges in the dependency graph, $e1=(u_{1},v_{1},t_{1})$ , $e2=(u_{2},v_{2},t_{2})$ , we consider that there is a dependence (causality) between $e_{1}$ and $e_{2}$ if $v_{1}=u_{2}$ and $t_{1}<t_{2}$ .

TABLE II: Attributes of system events

Event

Operations

Attributes

Process Event

execve, clone

Time Stamp Subject Name Object Name Data Amount

File Event

write, read,

readv, writev

Network Event

sendto, recvfrom

Refer to caption — Figure 1: Partial dependency graph of one attack case Dataleak. The black dashed box indicates the backtracking graph ( $\sim$ 200,000 edges) constructed from the POI event via backward propagation. The blue dashed box indicates the suspicious semantic graph ( $\sim$ 27 edges) constructed by SParse. The red dashed line box indicates the critical component graph ( $\sim$ 22 edges) exported by SParse.

II-B Attack Investigation

The goal of attack investigation using dependency graphs [12, 15, 13, 16, 20, 18, 14, 23, 32] is to identify all critical events and critical components related to the POI event. A critical component is a subgraph of the dependency graph that retains internal information critical to the attack investigation and eliminates irrelevant system activity. Typically this analysis includes tracing the flow of data through the graph to identify potentially relevant events, and examining the properties of nodes and edges to identify signs of compromise. The goal of an attack investigation is to determine both the source and scope of the attack, ascertain the extent of damage or disruption, and develop remediation and prevention strategies.

II-C Motivating Example

As shown in Figure 1, this is a typical data leakage attack. The attacker exploited a vulnerability in apache2 and downloaded the malicious artifact gather.sh. After executing the malware, the attacker collected sensitive data from the target host and saved it in the form of the file leaked.vm2. After using gpg to compress leaked.vm2 into the file leaked, the attacker transferred leaked to the C2 server 192.168.2.3:xx via process ssh.

In Figure 1, the black dashed box denotes the backtracking graph obtained by performing backward causality analysis [29], which includes all events causal-related to the alert. The blue dashed box denotes the suspicious semantic graph obtained by using suspicious semantic transfer, which includes all events suspicious semantic-related to the alert. The red dashed box denotes the critical component graph obtained by analyzing the path-level contextual semantics, which includes all events attack-related to the alert.

Obviously, the number of attack-related critical events ( $\sim$ 22) is a drop in the ocean compared to the number of causal-related non-critical events ( $\sim$ 200,000). This turns the attack investigation into a needle-in-a-haystack process, making it challenging for analysts to complete the investigation in the optimal time (600s) [24]. However, existing technique such as DEPIMPACT [18], as shown in Section V, has exhibited poor performance. It requires 6,464s ( $\gg$ 600s) to generate dependency graphs with 2,473 false positives on average. In addition, it needs to load raw audit logs, resulting in an endless memory overhead. Therefore, we need an attack investigation system with low false positives, low latency and low overhead.

III Overview

III-A Threat Model

First, we assume that the event logs and digital signatures are credible, similar to previous work [33, 34, 27, 15, 29, 17, 13, 18]. In addition, events related to the attack did not occur before the logs were processed.

Second, we assume that the attacker is external to the system and carries out their attack remotely. This may involve exploiting vulnerabilities within the system or employing social engineering tactics to convince a user to download and run a file containing malicious code. Therefore, we do not support side-channel attacks and insider attacks where the attacker has a legitimate way to access the machine without going through them.

Third, we exclude mimicry attacks [35] from consideration in our threat model. These attacks are designed to evade intrusion detection systems by creating a seemingly benign chain of events within an enterprise environment. Existing intrusion detection systems [36, 37, 38] often rely on heuristics or analysis of individual events, making them vulnerable to such attacks. While detecting mimicry attacks is a limitation of current detection systems, it falls outside the scope of our work. Our focus is on identifying relevant events of alert generated by the detection system as contextual information to investigate the attack.

III-B Our Approach

In this section, we describe the architecture of SParse shown in Figure 2. Given a POI event, SParse can automatically identify the critical component of the dependency graph. SParse consists of two phases: (I) suspicious semantic graph construction (SSGC) and (II) path-level contextual analysis (PCA).

In Phase I, SParse makes use of mature auditing systems [4, 39, 40, 41, 42] to access kernel-level streaming logs and process them into specific data structures. Then SParse proposes a suspicious semantic transfer rule and storage strategy to maintain the suspicious entity list and related event table with low memory overhead. Given a POI event, SParse can construct the suspicious semantic graph (SSG) in real-time.

In Phase II, SParse first performs edge compaction on the suspicious semantic graph. Then SParse proposes a suspicious flow path extraction algorithm to identify possible propagation paths of the data/control flow in the suspicious semantic graph (i.e., suspicious flow paths). Next, SParse performs path-level contextual analysis, scores each suspicious flow path, and determines how relevant the path is to the POI event. Finally, SParse filters out all events that only exist in irrelevant paths from suspicious semantic graph to generate the critical component graph (CCG) as the output.

IV System Design

In this section, we describe the design details of each phase of SParse. As shown in Figure 2, SParse is a two-phase framework (i.e., constructing suspicious semantic graph and performing path-level contextual analysis) for mitigating the dependency explosion problem.

IV-A Goal and Key Insight

IV-A1 Suspicious Semantic Graph Construction

Goal. Given an alert point, current investigation techniques [15, 43, 17, 18, 28] store audit logs in memory (high overhead) and construct a backtracking graph [29] from the alert point. However, it usually includes numerous events that are impossible to result in an attack (high false positives), such as reads to read-only files, and interactions with benign processes. Additionally, it takes time to identify related events by going through these logs (high latency). In summary, the backtracking graph gives rise to the problems of high memory overhead, high time overhead, and high false positives in existing approaches. Therefore, we aim to construct a suspicious semantic graph with low memory overhead in real-time, which is smaller in size than the backtracking graph but contains all attack-related events.

Key Insight. To achieve the aforementioned goal, there are two key insights upon which we rely. (1) Suspicious semantics are introduced externally, i.e., attack is implemented remotely, as defined in Section III-A. (2) Suspicious semantics propagate between entities, i.e., suspicious entities transmit the suspicious semantics to non-suspicious entities via interaction.

Based on the above insights, we present a state-based framework to achieve the goal, which includes Section IV-B Streaming Log Monitoring and Section IV-C Suspicious Semantic Transfer.

IV-A2 Path-level Contextual Analysis

Goal. Once the POI event is identified, we construct the corresponding suspicious semantic graph. This suspicious semantic graph consists of all events semantically related to the POI event and contains all attack-related events (i.e., critical events). As shown in Section V, the SSG ( $\sim$ 417 edges) is 545 $times$ smaller than the backtracking graph ( $\sim$ 227,589 edges) but 3.7 $times$ larger than the critical component graph ( $\sim$ 113 edges). This suggests that there are still many false positives in the suspicious semantics graph. Therefore, we aim to filter out the events that are contextually irrelevant to the POI event in the suspicious semantics graph by performing path-level contextual analysis. By mitigating the dependency explosion problem for the second time, we obtain a critical component graph to assist analysts in conducting attack investigation.

Key Insight. To achieve path-level contextual analysis, we rely primarily on the following two key insights : (1) Only by evaluating data/control flow paths as a whole we can determine whether they have an impact on the POI event. In other words, we cannot determine whether an event has impacted a POI event in isolation (i.e., at the event-level) [15, 18], but rather in context (i.e., at the path-level). (2) Quantifying the degree of impact requires consideration of the properties and neighboring relationships of events.

Based on the above insights, we propose a path-level contextual analysis mechanism consisting of Section IV-D Edge Compaction, Section IV-E Suspicious Flow Path Extraction and Section IV-F Path-level Contextual Scoring.

IV-B Streaming Log Monitoring

SParse makes use of mature auditing systems [4, 39, 40, 41, 42] to access kernel-level logs and obtain the required data. At the entity level, SParse focuses on three entity types: file, process, and socket. To differentiate, SParse needs to construct unique identifiers for all entities. For the file, SParse records the absolute path as the unique identifier. For the process, SParse concatenates the PID and name as the unique identifier. For the socket, SParse constructs the 4-tuple (<srcip, srcport, dstip, dstport>) as the unique identifier. At the event level, SParse focuses on three event types: process interactions, file IO events, and network IO events. To the best of our knowledge, existing auditing systems are rich in semantics and meet the data requirements of SParse.

TABLE III: Suspicious Semantic Transfer Rule.

Event Type	Subject	Object	Description
Recvfrom	Socket	Process	A process receives data from the network, the process becomes suspicious.
Sendto	Process	Socket	A suspicious process sends data to the network.
Read	File	Process	A process reads a suspicious file, the process becomes suspicious.
Write	Process	File	A suspicious process writes a file, the file is suspicious.
Execve/Clone	Process	Process	A process is started by a suspicious process, the process is suspicious.

IV-C Suspicious Semantic Transfer

The letter P in APT stands for persistence, which means that an attacker can lurk for a long time until achieves the goal. To support real-time investigation and long-term monitoring, SParse utilizes a state-based structure and suspicious semantic transfer rule to record state changes and associated events for each entity. We next describe the specific data structure and transfer rule in turn.

IV-C1 Data Structure

For any entity $v\in V$ , SParse represents it as a triple $<U,\ T_{y},\ S>$ . $U$ is the unique identifier of the entity, the construction of $U$ is described in Section IV-B. $T_{y}$ denotes the type of the entity and S denotes the state of the entity. When $S$ is 0 it means that the entity is not suspicious, and $S$ is 1 it means that the entity is suspicious. Note that the file and process have their $S$ initialized to 0 when they are created, and the socket has their $S$ initialized to 1 when it is created, i.e., we default to all sockets that are not in the whitelist being suspicious (Key Insight (1) in Section IV-A1).

For any event $e\in E$ , SParse represents it as a quintuple $<U_{s},U_{o},O,T_{i},D>$ . $U_{s}$ and $U_{o}$ are unique identifiers for the subject and object of $e$ , respectively. $O$ denotes the type of $e$ , $T_{i}$ denotes the time when $e$ occurred, and $D$ denotes the data flow amount of $e$ . Note that SParse is based on the direction of the data flow and control flow to determine the location of the subject and object. For example, when $O=Read$ , the data flow is from the file to the process, so the file is the subject and the process is the object. When $O=Write$ , the data flow is from the process to the file, so the process is the subject and the file is the object.

IV-C2 Transfer Rule

Based on the idea of semantic transfer (Key Insight (2) in Section IV-A1), SParse constructs a set of predefined rules to process streaming logs and identify entity states in real-time. As shown in Table III, each rule is a quadruple: $<O,\ T_{s},\ T_{o},\ D>$ . $O$ is the type of event, $T_{s}$ and $T_{o}$ are the entity types of the subject and object respectively, and $D$ is a description of the rule. From Table III we can see that the subject is able to transfer suspicious semantics to the object via a specific event, which is referred to as the ”suspicious semantics transfer rule”. As shown in Figure 3, $T$ denotes the moment, red entities denote suspicious entities, and red straight arrows denote suspicious semantic transfer. When $T=3,$ a suspicious process ( $process\ A$ ) writes data to a file ( $file\ C$ ), which in turn carries the suspicious semantic. When $T=4$ , the suspicious file is read by another process ( $process\ D$ ), which then carries the suspicious semantic. Conversely, if an entity has no suspicious semantic, any event involving this entity as a subject will not propagate suspicious semantic. For example, when $T=2$ , a file ( $file\ B$ ) is read by the suspicious entity ( $process\ A$ ), but there is no propagation of the suspicious semantic.

As shown in lines 5 to 11 in Algorithm 1, SParse processes the streaming logs, analyses data flows, and determines whether the entity state transitions. First, SParse accesses the event $e=<U_{s},U_{o},O,T_{i},D>$ and constructs the subject and object $u$ , $v$ corresponding to that event. Then, SParse determines whether the subject $u$ is a socket or exists in suspicious entity list (SEL, see Section IV-C3 for detailed definition). Finally, as soon as one of these two conditions is met, SParse will mark the object $v$ as suspicious and add it to SEL.

Algorithm 1 Suspicious Semantic Graph Construction

1: (1) Streaming logs in chronological order;

2: (2) Suspicious Entity List (SEL);

3: (3) Related Event Table (RET);

4: (4) POI event

p

;

4: Suspicious semantic graph for POI event

p

;

5: for

e\in Streaming\ \ logs

6: Construct

u,v

from

e

where

u_{U}=e_{U_{s}},v_{U}=e_{U_{o}}

7: if

u_{S}==0\ \textbf{and}\ \nexists\ \ u_{U}\in SEL

then

8: continue;

9: else

10:

v_{S}=1

;

11: if

\nexists\ \ v_{U}\in SEL

then

12: SEL.append(

v_{U}

);

13: end if

14: Add {

v_{U}:

RET[

u_{U}

] + e} to RET;

15: end if

16: if

\exists\ p_{U_{o}}\in SEL

then

17:

18: return graphConstruct(RET[

p_{U_{o}}

])

19: end if

20: end for

Algorithm 2 Suspicious Flow Path Extraction

1: (1) Suspicious Semantic Graph

G

;

2: (2) POI Event

p

;

3: (3)

Q

and

V

for the queue structure,

T

for the tree structure;

3: Suspicious flow paths;

Q.add(p)

T.creatNode(p)

6: while

Q.num\neq 0

e=Q.pop()

V.add(e)

9: for

ie\ \in\ G.inEdges(e_{U_{s}})

10: if

ie_{T_{i}}<e_{T_{i}}\ \ \textbf{and}\ \ ie\notin Q\ \textbf{and}\ ie\notin V

then

11:

Q.add(ie)

12:

T.creatNode(ie)

13:

T.creatEdge(e,\ ie)

14: end if

15: end for

16: end while

17: return

T.allPaths()

IV-C3 Storage Strategy

SParse designs two data structures to enable efficient storage of relevant data and real-time construction of the suspicious semantic graph. Specifically, SParse designs a Suspicious Entity List (SEL) and a Related Event Table (RET), as defined below.

Suspicious Entity List: A list that maintains all entities with suspicious semantics (possibly related to attacks). As shown in Figure 3, when $T=1$ , the data flow passes from the suspicious socket to process $A$ (suspicious semantic transfer), so SParse adds entity $A$ to SEL. When $T=2$ , there is no suspicious semantic transfer, so SEL is not changed. When $T=5$ , the data flow passes from suspicious file $C$ to process $A$ , but entity $A$ is already in SEL, so SEL is not changed.

Related Event Table: A table that holds all the related events corresponding to all suspicious entities. The related events of a suspicious entity refer to the set of all data flows and control flows that lead to this entity’s semantic change. Specifically, SParse will maintain a separate set of related events in RET for all suspicious entities. Whenever an event $e=<U_{s},\ U_{o},\ O,\ T_{i},\ D>$ that satisfies the suspicious semantic transfer rule occurs, SParse will stitch the related events of $U_{s}$ with event $e$ and use it as the related events of $U_{o}$ to update RET. As shown in Section V-C, the size of the RET is much smaller than the raw audit logs and the time of read in is negligible. SParse can construct a suspicious semantic graph related to the POI event in real-time.

As shown in Figure 3, when $T=1$ , SParse adds to RET with $A:\ \{1\}$ , indicating that the related event of the suspicious entity $A$ is $\{1\}$ . When $T=4$ , SParse adds to RET with $D:\ \{1,3,4\}$ , stitched from the related events $\{1,3\}$ of subject $C$ and the current event $\{4\}$ . When $T=5$ , SParse updates RET with $A:\ \{1,3,5\}$ , stitched from the related events $\{1,3\}$ of subject $C$ and the current event $\{5\}$ .

In order to speed up the consumption of log streams, SParse keeps the whole SEL in memory to determine the entity states and save suspicious entities in real-time. In contrast, inspired by the CPU architecture, SParse keeps only some of the high-modification (frequent growth in a short period) RETs in memory and stores other low-modification RETs in the hard disk. According to our experimental results (see Section V-C for detail), the memory overhead of SParse is 30MB on average, and there is no problem with high memory overhead. Note that we default sockets to suspicious entities, so only entities of file type and process type are saved in SEL.

In summary, SParse will use these two data structures to enable efficient storage of the necessary data and real-time construction of the suspicious semantic graph. As shown in lines 11 to 17 of Algorithm 1, SParse will add the object $v_{U}$ to the SEL for any event $e=<U_{s},\ U_{o},\ O,\ T_{i},\ D>$ that satisfies the semantic transfer rule. In addition, SParse stitches the related events of $U_{s}$ with event $e$ and uses it as the related events of $U_{o}$ to update RET. Finally, for any given POI event, SParse is able to extract all relevant events for the object of that POI event from the RET in real-time. SParse then uses a simple graph construction algorithm, which extracts entities from entities as nodes and events as edges, to construct a suspicious semantic graph associated with the POI event.

IV-D Edge Compaction

A suspicious semantic graph often contains multiple parallel edges between two nodes. This is because operating systems typically complete read/write tasks (e.g., file read/write) by proportionally allocating data to multiple system calls. Inspired by recent work for graph reduction [44], SParse merges the edges between two nodes if the time difference between them is less than a given threshold. We ultimately chose 10 seconds as it demonstrates reasonable results in terms of various system calls, such as file transfers and network connections.

IV-E Suspicious Flow Path Extraction

In order to perform path-level contextual analysis, it is first necessary to identify possible propagation paths of the data/control flow in the suspicious semantic graph (i.e., suspicious flow paths). SParse proposes a suspicious flow path extraction algorithm that can efficiently handle complex graph structures. In brief, as shown in Figure 4, SParse transforms the suspicious semantic graph into a multiway tree and then traverses it to obtain all suspicious flow paths.

Specifically, as shown in lines 1 to 5 of Algorithm 2, $Q$ and $V$ are the queue structures, where $Q$ holds the events to be traversed and $V$ holds the events that have been traversed. $T$ is the multiway tree structure, which holds the topological information. As shown in lines 6 to 9 of Algorithm 2, SParse traverses event $e$ , identifying all incoming edge $ies$ ( $ies=G.inEdges(e_{U_{s}})$ ). As shown in lines 10-13 of Algorithm 2, SParse determines that the incoming edge $ie$ ( $ie\in ies$ ) occurred earlier than event $e$ and has not been traversed ( $ie\notin V$ ), then creates node $ie$ in the multiway tree $T$ and the parent of that node is $e$ . Finally, SParse traverses the multiway tree $T$ to obtain all paths from the root node to the leaf nodes, which are output as suspicious flow paths.

The suspicious flow path extraction algorithm takes into account the timeliness and directionality of the data/control flow and is able to handle the complex graph structure efficiently, as demonstrated in Section V-C, where SParse extracts over 140 suspicious flow paths in one second on average. Finally, it is important to note that events exist as nodes in the multiway tree and suspicious flow paths, as shown in Figure 4.

IV-F Path-level Contextual Scoring

After extracting the suspicious flow paths, SParse needs to perform contextual analysis at the path-level to quantify the degree of influence of the entire path on the POI event (Key Insight (1) in Section IV-A2). Furthermore, the degree of impact between events is determined by the event attributes and the neighboring relationships between events (Key Insight (2) in Section IV-A2).

For each suspicious flow path $p$ , SParse calculates the $PathScore$ using the following equation:

PathScore=\sum_{E}^{e}EventScore(e)\ /\ Len(p)

(1)

where $e$ denotes an event and $E$ denotes the set of all events contained in the path $(e\in E)$ . $EventScore$ denotes the degree of impact of event $e$ on the parent node, as defined later. $Len(p)$ denotes the number of events in the path and is used to normalize the $PathScore$ .

SParse calculates the $EventScore$ using the following equation:

EventScore=\alpha\frac{Impact(e,f)}{\sum_{child(f)}^{s}Impact(s,f)}\ ,\ f=% parent(e)

(2)

\alpha=1+\frac{len(child(f)-1)}{C}

(3)

where $parent(e)$ denotes the parent of event $e$ , and $child(f)$ denotes all the children of event $f$ in the multiway tree. $Impact(e,f)$ denotes the degree of impact that event $e$ exerts on event $f$ , as defined later. $\alpha$ is an inflation factor to mitigate the problem of decreasing relative impact due to triage (a parent node with multiple children). As shown in Equation 3, $\alpha$ is controlled by the super parameter $C$ and the number of child nodes. It is negatively correlated with $C$ and positively correlated with the number of child nodes.

SParse picks two features (i.e., data flow amount and time), to calculate the $Impact$ using the following equation:

\displaystyle Impact(e1,e2)=CS(Nor(e1_{D},e1_{T_{i}}),Nor(e2_{D},e2_{T_{i}}))

(4)

where $e1$ and $e2$ are two events and $e2$ is the parent node of $e1$ (i.e., $e2=parent(e1)$ ). $e_{D}$ and $e_{T}$ denote the data flow and occurrence time of event $e$ , respectively. $Nor(\cdot)$ denotes normalization, which removes differences in $e_{D}$ and $e_{T}$ on the scale. $CS(\cdot)$ denotes the computation of cosine similarity.

Intuitively, we assume that if the data flow amount between parent node and child node is similar, then there is a causal relation between them (e.g., a process reads 526 bytes from the network and then immediately writes 526 bytes to a file, which may be the same content). Similarly, if the timestamps are similar, then there is a causal relation between the events since we think that the exploitation is automated and its steps quickly follow each other.

SParse will iteratively calculate the scores of all suspicious flow paths and consider the path whose score is below a threshold $T$ as an $irrelevant\ path$ . Then, SParse filters out events that only exist in the $irrelevant\ path$ and outputs the retained part as a critical component graph to help analysts in attack investigation.

V Evaluation

In this section, we first present the evaluation preparation, including the characteristics of the dataset, the obtaining of ground truth, and the setting of evaluation metrics. We then evaluate the effectiveness and efficiency of each component separately. In summary, we aim to answer the following questions:

•

RQ1: How effective is SParse in attack investigation?
•

RQ2: How efficient is SParse in attack investigation?
•

RQ3: How sensitive is SParse in parameter selection?

V-A Evaluation Preparation

We deploy our implementation of SParse on a computer with Intel (R) Core (TM) i9-10900K CPU @ 3.70GHz and 64GB memory. SParse processes streaming logs from the auditing systems Sysdig [41] and SPADE [5], extracts information in the format as described in Section II-A, and runs continuously in a low-overhead state.

V-A1 Attack Dataset

We evaluate the effectiveness of SParse in revealing attack sequences on a dataset with over 150 million system audit logs. As shown in Table IV, this dataset contains 15 attack cases (10 simulated attacks and 5 DARPA attacks), and is provided by DEPIMPACT [18]. The simulated attacks consist of 7 (rows 2 to 8) single-host attacks based on common exploits [45, 28, 44, 26] and 3 (rows 9 to 11) multi-host attacks based on Cyber Kill Chain [46] and CVE reports [45]. The simulated attacks utilized deployed hosts with 12 active users and hundreds of processes, daily tasks such as file manipulation, text editing, and software development were carried out to simulate real-world usage. We detail these 10 simulated attacks in Appendix Appendix 1-A1 and Appendix Appendix 1-A2. The DARPA dataset contains 5 host attacks (rows 12 to 16), which was done by two teams (FiveDirections and Theia), and differed in terms of target systems (Windows, Linux) and vulnerability exploits (pine backdoor, firefox backdoor, and browser extension).

Table IV shows the statistics of the generated dependency graphs for all attacks. Column “Attack” indicates the name of the attack case. Columns “# V” and “# E” indicate the number of nodes and edges of the backtracking graphs after performing causality analysis [29] from POI events. Column “# CE” shows the number of critical events (related to the attack), which we explain in detail below.

TABLE IV: The statistics of dependency graphs generated for all the 15 attacks.

Attack	# V	# E	# CE
Wget Executable	78	349	16
Illegal Storage	2,277	34,367	7
Illegal Storage2	9,345	290,933	7
Hide File	23,110	459,514	10
Steal Information	23,153	495,570	7
Backdoor Download	1,411	12,354	12
Annoying Server User	114	585	15
Shellshock	1,706	42,918	36
Dataleak	1,863	20,807	25
VPN Filter	2,436	39,332	29
Five Dir Case 1	259	473	8
Five Dir Case 3	6,109	83,154	9
Theia Case 1	175,196	794,341	8
Theia Case 3	281,001	1,137,829	8
Theia Case 5	245	1,309	5
Avg	35,220.20	227,589.00	13.47

TABLE V: Performance of dependency graphs generated by different technique. SSGC and PCA are the components of SParse.

Attack	SLEUTH			NODOZE			DEPIMPACT			SSGC			SSGC+PCA
Attack	FP	FN	# E	FP	FN	# E	FP	FN	# E	FP	FN	# E	FP	FN	# E	# SFP
Wget Executable	68	7	77	78	0	94	32	0	48	3	0	19	1	0	17	5
Illegal Storage	2189	5	2191	5686	1	5694	2625	0	2632	172	0	179	47	0	54	85
Illegal Storage2	2072	3	2076	11959	1	11967	1255	0	1262	986	0	993	252	0	259	522
Hide File	4558	5	178703	38919	2	38931	14982	0	14992	1590	0	1600	356	0	366	833
Steal Information	3972	3	179202	21114	1	21122	15774	0	15781	1654	0	1661	453	0	460	868
Backdoor Download	1392	6	9198	232	1	245	2625	0	2637	42	0	54	36	0	48	20
Annoying Server User	2	13	281	88	2	105	39	0	54	3	0	18	1	0	16	6
Shellshock	672	9	4299	1007	4	1047	161	4	197	87	0	123	9	0	45	40
Dataleak	622	7	4279	597	7	629	44	3	69	16	0	47	68	0	83	17
VPN Filter	722	8	5342	310	5	344	189	2	218	86	0	115	72	0	101	37
Five Dir Case 1	97	4	237	291	1	300	17	0	25	8	0	16	3	0	11	11
Five Dir Case 3	277	5	37058	717	1	727	171	0	180	82	0	91	31	0	40	22
Theia Case 1	453	6	264107	92091	2	92101	689	1	697	494	0	502	98	0	106	374
Theia Case 3	231	4	493626	7369	1	7378	1074	1	1082	820	0	828	61	0	129	598
Theia Case 5	2	1	845	395	1	401	3	0	8	9	0	14	2	0	7	15
Avg FP/FN/# E	1154	5.73	1163	16682	2.00	16696	2473	0.73	2487	403	0.00	417	99	0.00	113	230
Avg FPR/FNR( $10^{-2}$ )	0.507	42.54	/	7.329	14.85	/	1.087	5.419	/	0.177	0	/	0.043	0	/	/

•

$\dagger$ SSGC:Suspicious Semantic Graph Construction. PCA:Path-level Contextual Analysis. SFP:Suspicious Flow Path.

V-A2 Obtaining Ground Truth

In order to evaluate the performance of SParse, we need to specify the ground truth for all attack cases (i.e., identify all critical events). Specifically, we analyzed the targets of each attack case and determined the corresponding POI events from massive logs. We then conducted back-propagation [29] from POI events to obtain backtracking graphs and searched for critical events within them. Finally, we manually ascertained the critical events based on Indicators of Compromise (e.g., file names and malware names) and attack steps (e.g., download then execution), as shown in Appendix Appendix 1-A.

Evaluation Metrics. First, we measure false positives (FP) and false negatives (FN). False positives refer to those edges that SParse identifies as critical but are not, while false negatives refer to those edges that SParse identifies as irrelevant but are critical. Then we compute the false positive rate $FPR=FP/E_{total}$ and false negative rate $FNR=FN/E_{c}$ , where $E_{total}$ represents the number of edges and $E_{c}$ represents the number of critical edges, respectively.

V-B RQ1 : How effective is SParse in attack investigation ?

There has been a lot of graph-based related work on attack investigation [12, 47, 48, 21, 44, 18, 15, 13]. However, HOLMES[12] and RapSheet[13] rely solely on defined TTP-like (Tactics, Techniques, and Procedures) rules for detection and investigation. Such approaches suffer from heavy reliance on manual efforts and cannot effectively address zero-day vulnerabilities. HERCULE[47] and WATSON[21] address attack investigation problems by discovering communities (i.e., behavioral abstractions) on the provenance graph. Their purpose is to assist analysts in identifying attack stages from a community perspective, enabling a quick understanding of the purpose of a subgraph in the provenance graph (e.g., file compilation and uploading). HERCULE and WATSON (subgraph-level) differ in granularity from SPARSE (event-level). Hence, we do not compare our work with them.

Here, we compare the performance of SParse with 3 state-of-the-art approaches: SLEUTH [48], NODOZE [15], and DEPIMPACT [18], which are more relevant in terms of methodology (anomaly-score based) and granularity (event-level) for our evaluation. SLEUTH defines TTP-like rules with the added constraint that these rules only fire when certain confidentiality or integrity conditions are satisfied according to a tag-based information flow propagation. NODOZE measures the rarity of different events in the environment and based on this assigns anomaly scores to each event in the dependency graph. We use logs that only contain normal behavior (captured outside of attack periods) as execute profiles (i.e., statistics of events) to satisfy NODOZE. DEPIMPACT assigns anomaly scores to edges based on a number of characteristics (including time, data flow amount, and node access), and then aggregates the scores to determine the entry points through a propagation algorithm. DEPIMPACT then takes as output the overlap events of the forward graph of the entry point and the backward graph of the alert point. Finally, we also perform an ablation experiment on SParse to evaluate the output of different phases.

Table V shows the performance of attack investigation for different techniques in all cases. Lower FP/FPR indicate a better ability to filter irrelevant edges and lower FN/FNR indicate a better ability to retain critical edges. The results show that SParse (SSG + PCA) performs the best. On average, the critical component graph generated by SParse ( $\sim$ 113 edges) is 8849 $\times$ smaller than the original dependency graph ( $\sim$ 1,000,000 edges), 22 $\times$ smaller than the second-best result (i.e., DEPIMAPCT with $\sim$ 2,487 edges). SParse demonstrates the best capability in filtering irrelevant edges while preserving the attack sequences (FP = 99, FPR = 0.043* $10^{-2}$ ), 25 $\times$ more effective than DEPIMPACT (FP = 2,473, FPR = 1.087* $10^{-2}$ ). Moreover, SParse does not miss any critical edges (i.e., FN = 0, FNR = 0). Note that Column “SSGC” and Column “SFP” denote the suspicious semantic graph construction and suspicious flow path of the intermediate product of SParse, respectively.

SLEUTH investigates attack scenarios by defining confidentiality and integrity labels for label propagation. However, SLEUTH cannot ensure to cover all the attack-related edges and therefore performed the worst in FNR (FN = 5.73, FPR = 42.54* $10^{-2}$ ), resulting in ineffective support for attack investigation. NODOZE performed better than SLEUTH in including critical edges but worse in FPR (FP = 16,682, FPR = 7.329* $10^{-2}$ ; FN = 2, FNR = 14.85* $10^{-2}$ ), resulting in the ineffective reduction of investigation cost for analysts. The reason is that the performance of NODOZE relies on whether the execution profile comprehensively covers all benign behaviors but ignores information in the form of streams. DEPIMPACT heuristically selects 3 entry points (one each for files, processes, and sockets), which amplifies the attack surface. DEPIMPACT directly takes the intersection between the forward graph of the entry points and the backward graph of the alert point as output, which introduces massive attack-irrelevant events (FP = 2,473, FPR = 1.087* $10^{-2}$ ). Regarding the reduction results, SParse exhibits the best performance (FP = 99, FPR = 0.043* $10^{-2}$ ), we attribute the good performance of our system to (1) SParse employs insight of semantic transfer to construct a suspicious semantic graph related to the POI event that inherently filters out a significant number of irrelevant events (FP = 403, FPR = 0.177* $10^{-2}$ ), and (2) SParse evaluates the relevance of context to POI events at the path-level rather than in isolation.

In summary, we believe that SParse can satisfy the requirement of low false positives in practice.

TABLE VI: Comparison of data consumption rate and data generation rate.

Attack

Case

Generation Rate

(event num/pers)

Consumption Rate

(event num/pers)

Wget Executable

3,018

42,175

Illegal Storage

3,273

45,572

Illegal Storage2

3,156

34,727

Hide File

4,039

31,155

Steal Information

4,271

40,719

Backdoor Dowanload

3,380

39,451

Annoying Server User

3,153

42,155

Shellshock

3,741

42,742

Dataleak

3,692

42,899

VPN Filter

3,560

39,611

Five Dir Case 1

917

40,358

Five Dir Case 3

1,161

44,795

Theia Case 1

1,298

32,097

Theia Case 3

1,215

34,856

Theia Case 5

966

41,303

Avg

2,722

39,641

TABLE VII: Overhead performance of each component of SParse and baseline approach.

Attack

SSGC

PCA

DEPIMPACT

Mem.

(MB)

CPU

(%)

Disk

(MB)

Ti.

(s)

Mem.

(MB)

CPU

(%)

SFP

Ti.

(s)

Mem.

(MB)

CPU

(%)

Disk

(MB)

Wget Executable

29.75

3.14

0.072

0.09

108.20

4.97

4.45

125.72

7.62

0.79

Illegal Storage

29.92

3.32

7.15

0.32

124.97

4.36

285.10

178.89

7.46

79.21

Illegal Storage2

31.81

3.35

69.21

2.38

125.16

4.27

522

985.55

304.51

7.29

169.21

Hide File

34.82

3.59

101.12

6.22

147.34

4.94

833

19,814.25

592.82

7.51

401.12

Steal Information

36.55

3.61

109.17

6.66

155.60

4.90

868

12,701.39

596.26

7.82

409.17

Backdoor Dowanload

28.01

3.27

2.21

0.10

125.16

4.49

33.68

199.17

7.42

61.13

Annoying Server User

26.62

3.40

0.011

0.33

119.41

4.41

0.23

127.81

7.76

0.70

Shellshock

29.74

3.39

8.94

0.02

137.75

4.13

2.17

138.98

7.57

154.00

Dataleak

30.39

3.40

0.82

0.04

132.86

4.29

2.24

130.65

7.22

97.50

VPN Filter

31.01

3.26

0.82

0.06

146.08

4.27

15.46

135.49

7.43

537.95

Five Dir Case 1

28.83

3.29

7.15

0.04

124.28

4.56

2.08

124.12

7.67

0.82

Five Dir Case 3

30.05

3.50

37.71

0.11

135.04

4.29

1,674.90

189.08

7.72

88.65

Theia Case 1

31.97

3.43

59.50

2.34

137.35

4.15

374

24,761.90

844.28

7.64

479.20

Theia Case 3

35.08

3.62

68.45

4.85

143.90

4.80

598

36,682.87

885.83

7.79

592.23

Theia Case 5

29.12

3.30

0.13

0.19

129.21

4.67

4.91

138.60

7.15

1.41

Avg

30.91

3.39

21.03

1.58

132.82

4.50

230

6464.75

320.81

7.54

204.87

•

$\dagger$ SSGC:Suspicious Semantic Graph Construction. PCA:Path-level Contextual Analysis. SFP:Suspicious Flow Path.

V-C RQ2 : How efficient is SParse in attack investigation ?

In this section, we evaluate the efficiency of SParse when deployed in a real scenario. First, we evaluate SParse on the real-time performance by comparing data generation rate and data consumption rate. Table VI shows that SParse can consume events 15 $\times$ faster than the events generation rate of the host on average, which shows the real-time of SParse is feasible. Then we evaluate the overheads of each component of SParse on time, memory, CPU, and disk, as shown in Table VII. In addition, we conduct comparative experiments with the baseline method DEPIMPACT.

Time and Memory. SParse can perform path-level contextual analysis of the suspicious semantic graph and construct the critical component graph in 2s on average, which is 4091 $\times$ faster than DEPIMPACT ( $\sim$ 6,464s). DEPIMPACT uses an impact propagation algorithm to identify entry points. However, when the number of edges in the backtracking graph is high (e.g., case “Hide File”), it takes an extremely long time to reach global convergence (19,814s). In terms of memory overhead, three causes are making SParse smaller than DEPIMAPCT: (1) SParse only stores suspicious nodes in memory when processing streaming logs, so the memory overhead grows extremely slowly and can be seen as constant in scale (i.e., 30MB). (2) The suspicious semantic graphs ( $\sim$ 403 edges) read in by SParse are much smaller than the dependency graphs ( $\sim$ one million edges) read in by DEPIMPACT. (3) The suspicious flow path extraction algorithm applied in PCA only traverses the graph structure once (a complexity of $O(E)$ ) and generates a smaller number of SFPs (230 on average). For these reasons, the memory overhead of PCA (132.82 MB) is 2.4 $\times$ smaller than that of DEPIMAPCT (320.81 MB). Finally, it is important to emphasize that SParse is real-time for suspicious semantic graph construction (SSGC), so we do not perform a time overhead evaluation for this component.

CPU and Disk. SParse stores the relevant event table (RET) in the database when processing streaming logs. To evaluate the overhead of SParse on the hard disk, we also perform the relevant experiments. As shown in Table VII, SParse requires only 21.03 MB of disk space, which is 9 $\times$ smaller than the original logs (204.87 MB). This is because SParse only retains events of suspicious semantic relevance, whereas other techniques [10, 15, 18, 28, 17] require the whole logs, which significantly increases overhead on the disk. In addition, the CPU overhead of SParse is 4.5%, due to the simple yet intuitive and effective algorithm.

In summary, we believe that SParse outperforms the latest work DEPIMPACT in all aspects of overhead and can satisfy the requirements of low overhead and latency in application scenarios.

V-D RQ3 : How sensitive is SParse in parameter selection?

As shown in Section IV-F, there are two super parameters in SParse that need to be set: (1) the $C$ affects the expansion factor, (2) the $T$ filters the paths. As shown in Equation 3, the larger $C$ , the smaller the expansion factor $\alpha$ , the lower the path score, and the fewer events will be retained by SParse. Parameter $T$ , on the other hand, indicates the severity of SParse for path selection; the larger $T$ is, the fewer events will be retained by SParse.

We demonstrate the sensitivity of SParse in parameter selection by testing all combinations of the hyper-parameters $C$ and $T$ through grid search methodology. Specifically, we set the minimum value of $C$ to be 1, the maximum value to be 9, and the step size to be 1; set the minimum value of $T$ to be 0.1, the maximum value to be 0.9, and the step size to be 0.1; and test the performance of SParse on the metrics FP/FN as shown in Figure 5. As parameter $C$ increases, SParse reduces the path score and retains fewer events, hence the FP decreases. At the same time, SParse misses some critical events, causing FN to rise. As the threshold $T$ increases, SParse blocks more paths and preserves fewer events, so FP falls while FN rises. As shown in Figure 5(a), FP decreases gradually from the upper left to the lower right; as shown in Figure 5(b), FN increases gradually from the upper left to the lower right. The effects of these parameter changes on the SParse are in line with our predictions.

Obviously, FP and FN are two evaluation metrics that we both want to be as low as possible, but there is a trade-off between their performance for a system (i.e., a rise in one leads to a fall in the other). Finally, we choose $C$ = 5 and $T$ = 0.5 as the default parameters for SParse. Of course, the manufacturer can adapt these parameters to the specific scenario.

VI Discussion

Cooperation with existing techniques. There is a requirement for defenders to be able to detect and handle real-world attacks in real time. As an attack investigation system, SParse can be combined with a variety of existing techniques to meet this goal. By working with intrusion detection systems [12, 14, 23, 49] that can provide real-time alerts and defenses, SParse is able to investigate alerts for relevant events and provide a brief critical component graph to analysts. By working with compression systems [44, 10, 50, 51, 32] that can reduce redundant information, SParse is able to reduce memory and disk overheads, enabling years of relevant data storage. By working with analysis systems [13, 43, 15] that automatically determine the authenticity of alarms, SParse is able to provide streamlined but sufficient relevant events to support the triage of alarms.

Evasion Attacks. Existing investigation techniques, such as DEPIMAPCT, utilize weight computation and score-propagation techniques to identify attack entry points. However, this insight of independently calculating the weights of events does not fit the situation where information flows between entities. As a result, an attacker can inject a payload by writing multiple times and thus evade tracking. In contrast, SParse mitigates the impact of this by performing contextual analysis in path-level to synthesize the relevance between a path and an alert. An attacker may evade investigation by going the long way around (i.e., repeating nonsensical behavior) as in the attack case “Hide File” where the attacker changes the file name multiple times. As shown in Section IV-F, SParse is able to mitigate this problem by performing relative score calculation and event score inflation mechanisms.

Limitation. To implement attack investigation, SParse relies on alerts initiated by Endpoint Detection and Response (EDR) placed on the host. SParse cannot perform attack investigation if the detection system fails to launch alerts (identify suspicious behavior). Recent approaches [52, 53] propose solutions to improve the detection of anomalous system activity, and SParse can work with these approaches to provide better defenses. If the detection system initiates false positives frequently, SParse can only identify relevant events but cannot filter these false alters. But SParse can work with alarm triage techniques [13, 15, 50] to help them filter false alarms by providing a streamlined critical component graph. In addition, as shown in Section IV-C3, SParse needs to maintain a suspicious entity list in memory and a related event table on disk. As the runtime grows, there is redundant information in the related event table, such as $Event\ 10$ being stored four times in Figure 3. By working with existing compression systems [44, 10, 50, 51, 32], SParse can effectively mitigate this situation and enable long deployment runs.

VII Related Work

The analysts need to perform threat alert validation and post-mortem analysis of incidents. Currently, while auditing is by no means the only form of forensic investigation, it is telling that 75% of incident response specialists consider logs to be the most valuable form of investigation artifact [54]. Several studies have focused on implementing efficient log collection, such as kellect [55] for Windows and e-bpf [56, 57] for Linux, which is a pre-task for attack investigation. And in terms of methodologies for attack investigation using logs, they can be categorized into three categories: label propagation-based, anomaly score-based, and machine learning-based.

Label Propagation-based. Labels are given to nodes and are propagated to other nodes by system calls. When an alert arises, the analyst can easily retrace the events associated with the alert based on the label. Milajerdi et al. propose HOLMES [12], which mitigates the dependency explosion problem by requiring the aggregation of more labels to raise the detection threshold. RapSheet [13] makes use of the tactical provenance graph (TPG), which instead of encoding low-level system event dependencies, reasons about the causal relationships between threat alerts. RapSheet proposes a threat scoring scheme that evaluates the severity of each alert based on TPGs to enable effective investigation of alerts. Zhong et al. [58], on the other hand, mine the analysts’ security operation traces to learn label-propagation rules, and then use these rules to identify relevant paths automatically when an alert occurs. CONAN [23] iteratively performs malicious behavior determination with label passing and aggregation through data provided by ETW, enabling real-time detection and investigation. These label-based approaches rely on heuristic rules that cannot handle all types of attacks and have a high level of false negatives in attack investigation.

Anomaly Score-based. The essential view of these methods is to quantify the suspiciousness of edges between pairs of nodes. Pei et al.’s HERCULE [47] system correlates multi-source heterogeneous logs to construct a multi-dimensional weighted graph and uses the unsupervised community detection algorithm Louvain [59] to discover attack-related paths from it. NODOZE [15] and PRIOTRACKER [17], on the other hand, perform statistics on historical data and assign anomaly scores to events in the dependency graph. The score propagation algorithm is then used to find suspicious events. Both of these methods rely on statistics of historical data and cannot be applied in complex and variable generative environments. DEPIMPACT [18] calculates dependency weights globally based on multiple features (including time, data traffic amount, and node access) and then aggregates the weights to nodes to determine suspicious points of intrusion. The overlapping parts between the forward graph of the entry point and the backward graph of the alarm point are then considered attack-related events. These score-based methods can cover all critical events. However, there is no restriction on the variation of scores to solve the dependency explosion problem, thus leading to higher false positives. In addition, these methods require reading in all the logs to build the dependency graph, which is very expensive in terms of hard disk and memory.

Machine Learning-based. Some techniques use machine learning methods to learn contextual and structural information from dependency graphs to identify the most relevant abnormal events to alert. ATLAS [19] uses a novel combination of causal analysis, natural language processing, and machine learning to construct sequence-based models as a way to establish critical patterns of attack and non-attack behavior in the dependency graph. On the other hand, DEPCOMM [20] proposes a novel graph summarization method by dividing the large graph into process-centric subgraphs. DEPCOMM then extracts summaries from each subgraph, enabling the generation of summary graphs from dependency graphs, thereby reducing the difficulty of investigation for analysts. WATSON [21] automatically abstracts and clusters high-level system behavioral features from low-level audit events. WATSON performs a Depth-First Search on each object to summarise system behavior and then uses machine learning to infer the semantics of each audit event based on its context. Finally, behaviors with similar semantics are aggregated in the embedding space to identify similar events to the alert in order to investigate the attack. These machine learning-based approaches suffer from inadequate training samples, poor generalization capabilities, and high computational costs.

In contrast to previous work, SParse employs a hybrid method in the specific domain of causality tracking. SParse first uses suspicious semantic delivery rule to construct suspicious semantic graph. Then SParse uses path-level contextual analysis to extract a streamlined critical component graph.

VIII Conclusion

We propose SParse, a system that processes streaming logs and outputs critical events (attack-related events) according to alert in real-time. Specifically, SParse constructs a suspicious semantic graph related to the POI event by suspicious semantic transfer rule and storage strategy. Then SParse uses a suspicious flow path extraction algorithm to extract all reachable flow paths from the suspicious semantic graph. Finally, SParse uses path-level contextual analysis to score all paths and filters irrelevant events to obtain the final critical component graph. Our evaluation of real attacks demonstrates that SParse achieves low false positives (FP = 99), low overhead (30MB for memory and 21.03MB for hard disk), and low latency (1.58s for attack investigation).

References

[1] “What twitter’s 200 million-user email leak actually means,” https://www.wired.com/story/twitter-leak-200-million-user-email-addresses/.
[2] “Mitre att&ck,” https://attack.mitre.org/.
[3] “System administration utilities,” https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/security_guide/chap-system_auditing.
[4] “About event tracing,” https://docs.microsoft.com/en-us/windows/win32/etw/about-event-tracing.
[5] A. Gehani and D. Tariq, “Spade: Support for provenance auditing in distributed environments,” in ACM/IFIP/USENIX Int. Middleware Conf., MIDDLEWARE. Springer, 2012, pp. 101–120.
[6] S. Ma et al., “Kernel-supported cost-effective audit logging for causality tracking,” in USENIX ATC, 2018, pp. 241–254.
[7] A. Bates et al., “Trustworthy whole-system provenance for the linux kernel,” in USENIX), 2015, pp. 319–334.
[8] M. A. Inam et al., “Sok: History is a vast early warning system: Auditing the provenance of system intrusions,” in S&P. IEEE, 2022, pp. 307–325.
[9] K. H. Lee et al., “High accuracy attack provenance via binary-based execution partition.” in NDSS, vol. 16, 2013.
[10] Y. Tang et al., “Nodemerge: Template based efficient data reduction for big-data causality analysis,” in CCS, 2018, pp. 1324–1337.
[11] Z. Xu et al., “High fidelity data reduction for big data security dependency analyses,” in CCS, 2016, pp. 504–516.
[12] S. M. Milajerdi et al., “Holmes: real-time apt detection through correlation of suspicious information flows,” in S&P. IEEE, 2019, pp. 1137–1152.
[13] W. U. Hassan et al., “Tactical provenance analysis for endpoint detection and response systems,” in S&P. IEEE, 2020, pp. 1172–1189.
[14] T. Zhu et al., “Aptshield: A stable, efficient and real-time apt detection system for linux hosts,” TDSC, 2023.
[15] W. U. Hassan et al., “Nodoze: Combatting threat alert fatigue with automated provenance triage,” in NDSS, 2019.
[16] Hassan et al., “This is why we can’t cache nice things: Lightning-fast threat hunting using suspicion-based hierarchical storage,” in ACSAC, 2020, pp. 165–178.
[17] Y. Liu et al., “Towards a timely causality analysis for enterprise security.” in NDSS, 2018.
[18] P. Fang et al., “Back-propagating system dependency impact for attack investigation,” in USENIX, 2022, pp. 2461–2478.
[19] A. Alsaheel et al., “Atlas: A sequence-based learning approach for attack investigation.” in USENIX, 2021, pp. 3005–3022.
[20] Z. Xu, P. Fang, C. Liu, X. Xiao, Y. Wen, and D. Meng, “Depcomm: Graph summarization on system audit logs for attack investigation,” in 2022 IEEE Symposium on Security and Privacy (SP). IEEE, 2022, pp. 540–557.
[21] J. Zeng et al., “Watson: Abstracting behaviors from audit logs via aggregation of contextual semantics.” in NDSS, 2021.
[22] W. U. Hassan et al., “Tactical provenance analysis for endpoint detection and response systems,” in S&P. IEEE, 2020, pp. 1172–1189.
[23] C. Xiong et al., “Conan: A practical real-time apt detection system with high accuracy and efficiency,” TDSC, vol. 19, no. 1, pp. 551–565, 2020.
[24] “Lateral movement,” https://www.crowdstrike.com/cybersecurity-101/lateral-movement/.
[25] “Apt notes,” https://github.com/aptnotes/data/.
[26] Y. Kwon et al., “Mci: Modeling-based causality inference in audit logging for attack investigation.” in NDSS, vol. 2, 2018, p. 4.
[27] P. Gao et al., “ $\{$ AIQL $\}$ : Enabling efficient attack investigation from system monitoring data,” in USENIX, 2018, pp. 113–126.
[28] S. Ma et al., “Protracer: Towards practical provenance tracing by alternating between logging and tainting.” in NDSS, vol. 2, 2016, p. 4.
[29] S. T. King and P. M. Chen, “Backtracking intrusions,” in SOSP, 2003, pp. 223–236.
[30] “Darpa.” https://www.darpa.mil/program/transparent-computing.
[31] “Darap3 transparent engagement 3,” 2023, https://drive.google.com/drive/folders/1QlbUFWAGq3Hpl8wVdzOdIoZLFxkII4EK.
[32] T. Zhu et al., “General, efficient, and real-time data compaction strategy for apt forensic analysis,” TIFS, vol. 16, pp. 3312–3325, 2021.
[33] P. Gao et al., “Enabling efficient cyber threat hunting with cyber threat intelligence,” in ICDE. IEEE, 2021, pp. 193–204.
[34] Gao et al., “ $\{$ SAQL $\}$ : A stream-based query system for real-time abnormal system behavior detection,” in USENIX, 2018, pp. 639–656.
[35] D. Wagner and P. Soto, “Mimicry attacks on host-based intrusion detection systems,” in CCS, 2002, pp. 255–264.
[36] M. Bishop et al., Introduction to computer security. Addison-Wesley Boston, 2005, vol. 50.
[37] C. Kruegel et al., Intrusion detection and correlation: challenges and solutions. Springer Science & Business Media, 2004, vol. 14.
[38] “Insider threat monitoring software,” 2023, https://www.netwrix.com/insider_threat_detection.html.
[39] “Auditd,” 2023, https://linux.die.net/man/8/auditd.
[40] “Lttng,” 2023, https://lttng.org.
[41] “Sysdig,” 2023, https://github.com/draios/sysdig.
[42] “Redhat,” 2023, https://github.com/linux-audit/.
[43] W. U. Hassan et al., “This is why we can’t cache nice things: Lightning-fast threat hunting using suspicion-based hierarchical storage,” in ACSAC, 2020, pp. 165–178.
[44] Z. Xu et al., “High fidelity data reduction for big data security dependency analyses,” in CCS, 2016, pp. 504–516.
[45] “Exploit database,” Exploit Database, https://www.exploit-db.com/.
[46] “Cyber kill chain,” 2023, https://www.lockheedmartin.com/en-us/capabilities/cyber/cyber-kill-chain.html.
[47] K. Pei et al., “Hercule: Attack story reconstruction via community discovery on correlated log graph,” in ACSAC, 2016, pp. 583–595.
[48] M. N. Hossain et al., “Sleuth: Real-time attack scenario reconstruction from cots audit data.” in USENIX, 2017, pp. 487–504.
[49] T. Kim, X. Wang, N. Zeldovich, M. F. Kaashoek et al., “Intrusion recovery using selective re-execution.” in OSDI, 2010, pp. 89–104.
[50] M. N. Hossain et al., “Dependence-preserving data compaction for scalable forensic analysis,” in USENIX, 2018, pp. 1723–1740.
[51] N. Michael et al., “On the forensic validity of approximated audit logs,” in ACSAC, 2020, pp. 189–202.
[52] S. Wang et al., “Heterogeneous graph matching networks,” arXiv preprint arXiv:1910.08074, 2019.
[53] X. Han et al., “Unicorn: Runtime provenance-based detector for advanced persistent threats,” arXiv preprint arXiv:2001.01525, 2020.
[54] “Carbon black,” https://www.carbonblack.com/global-incident-response-threatreport/november-2018/.
[55] T. Chen, Q. Song, X. Qiu, T. Zhu, Z. Zhu, and M. Lv, “Kellect: a kernel-based efficient and lossless event log collector,” arXiv preprint arXiv:2207.11530, 2022.
[56] J. Byrnes et al., “A modern implementation of system call sequence based host-based intrusion detection systems,” in TPS-ISA. IEEE, 2020, pp. 218–225.
[57] S.-Y. Wang et al., “Design and implementation of an intrusion detection system by using extended bpf in the linux kernel,” JNCA, vol. 198, p. 103283, 2022.
[58] C. Zhong et al., “Automate cybersecurity data triage by leveraging human analysts’ cognitive process,” in HPSC. IEEE, 2016, pp. 357–363.
[59] V. D. Blondel et al., “Fast unfolding of communities in large networks,” Journal of statistical mechanics: theory and experiment, vol. 2008, no. 10, p. P10008, 2008.
[60] “Vpnfilter: New router malware with destructive capabilities,” 2018, https://symc.ly/2IPGGVE.
[61] “Ebay.” Ebay Inc. to ask Ebay users to change pass-words, 2014, http://blog.ebay.com/ebay-inc-ask-ebay-users-change-passwords/.
[62] “Schneier security: Router vulnerability the vpnfilter botnet,” 2018, https://www.schneier.com/blog/archives/2018/06/router_vulnerab.html.

Appendix Appendix 1 Appendix

Appendix 1-A Attack Cases

In this section, we show the ground truth for the 10 attack cases used for evaluation in Section V-A2. As shown in Figure 6 and Figure 7, for entities, we use rectangles ( $\langle ProcessName,\ PrcoessID\rangle$ ) for processes, ellipses ( $\langle FileName\rangle$ ) for files, and diamonds ( $\langle SrcIP\ :\ SrcPort\rangle$ ) for sockets. For events, we use solid lines with arrows, where the arrows indicate the flow of information. In addition, the number on the solid line indicates the relative time of the event. The solid red line indicates the POI event that triggered the alarm.

Appendix 1-A1 Attacks Based on Commonly Used Exploits

These 7 attacks are applied in the evaluations of previous works [28, 44, 18, 26], and consisted of the following scenarios:

•

Wget Executable [44]: Unsecured servers expose a vulnerability, allowing unauthorized users to fetch executable Python scripts through wget and execute them, as shown in Figure 6(a).
•

Illegal Storage [28]: Leveraging wget, a server administrator retrieves suspicious files and deposits them into a user’s home directory, as shown in Figure 6(b).
•

Illegal Storage 2 [28]: Leveraging curl, a server administrator retrieves suspicious files and deposits them into a user’s home directory, as shown in Figure 6(c).
•

Hide File [26]: With the intention of concealing a malicious file among user’s normal files, the attacker downloads a script and obfuscates it by altering the filename and location, as shown in Figure 6(d).
•

Steal Information [28]: The attacker steals user’s sensitive data and stores it in a covert file, avoiding detection, as shown in Figure 6(e).
•

Backdoor Download [28]: A malicious insider establishes a connection to a rogue server using the ping command. Subsequently, the insider downloads a concealed backdoor script and hides the script, as shown in Figure 6(f).
•

Annoying Server User [26]: A malicious user, gaining access to other users’ home directories, injects superfluous data into their files, as shown in Figure 6(g).

Appendix 1-A2 Multi-host Intrusive Attacks

In Attack 1, known as Shellshock Penetration, the attacker, following the initial exploit on Host 1, establishes a connection to cloud services (e.g., Dropbox, Twitter). Here, an image containing the C2 server’s IP address encoded in the EXIF metadata is downloaded. This tactic, reminiscent of advanced persistent threat (APT) attacks [60, 61], aims to evade network-based detection systems relying on DNS blacklisting. Leveraging the obtained IP address, the attacker proceeds to download malware from the C2 server to Host 1. Upon execution of the script, an examination of the ssh configuration file ensues, revealing reachable hosts in the network, including Host 2, Host 3, and Host 4. Subsequently, the malware fetches another script from the C2 server and disseminates it to the identified hosts, extracting passwords in the process, as shown in Figure 7(a).

In Attack 2, known as Data Leakage, the attacker, post-reconnaissance, acquires another malware, leak_data.sh, from the C2 server, distributing it to Host 2. This malware scans for concealed files and files containing sensitive strings, compressing them into a tarball named leak.tar.bz2. The compressed tarball is then transmitted back to Host 1, where it undergoes encryption before being uploaded to the internet, as shown in Figure 7(b).

In Attack 3, known as VPN Filter [60], focuses on sustaining a direct connection to victim hosts from the C2 server. The attacker employs the notorious VPN Filter malware [62] to build initial breach on Host 1 and discover Host 2. Then attacker downloads the VPN Filter stage 1 malware from the C2 server to Host 1, subsequently transferring it to Host 2. This malware initiates the download of another executable from the C2 server, executing it to launch the attack and establish a connection with the C2 server. Through this established connection, the attacker transfers a malicious script to Host 2, aimed at gathering sensitive data on the compromised host, as shown in Figure 7(c).