5.3 Simplification-based Reduction
To reduce the sheer size of logs, a garbage collector for audit logs, namely LogGC, was proposed in [
65]. A modified version of the classic reachability-based memory garbage collection algorithm removes redundant and unreachable nodes. LogGC optimizes the garbage collection process by partitioning long-running processes into units and files into logical data units and thus, can reduce the audit log size for forensic analysis by 14 times for regular applications and 37 times for server applications compared to BEEP [
64].
Another simplification-based reduction approach is ProTracer [
75], which reduces the sheer amount of provenance data by alternating between logging and provenance propagation. Based on the observation that often processes only read files but neither write to the permanent storage nor the external environment such as sending data over the network, those traces can be removed from the provenance data. ProTracer first partitions long-running processes into units and taints units that conduct read operations with the source they have read. The provenance data of these units is only getting logged when the units conduct any write operations before they are terminated. Secondly, ProTracer avoids logging dead events that do not permanently affect the systems by tainting a unit that conducts internal write operations. For example, if another unit does not access the created files during its lifecycle, the files are temporary files and thus, are not getting logged. Thirdly, ProTracer avoids the redundant logging of units by tainting units that behave the same as already logged units. As long as their behavior doesn’t change, the provenance data of these units is not getting logged. Hence, ProTracer can reduce the space overhead by a minimum of 96% for log entries and 98% for disk space compared to BEEP [
64].
Previous graph summarization approaches [
65,
75] were applied after the logs have already been collected, transferred, and temporally stored. Those steps already produce a significant runtime overhead and temporal space overhead. To address this issue, a cache-based in-kernel online graph summarization system, namely
Kernel-supported Cost-effective Audit Logging (KCAL), was proposed [
73]. KCAL is a modification of the Linux Audit Systems and applies BEEP [
64] to split long-running processes into units. First, in-unit redundancies, i.e., a unit performs the same operations on the same object, and cross-unit redundancies, i.e., different units that perform the same operations, are detected. Second, temporary files, i.e., files that get created, finished, and deleted by the same process, are identified. As a result, KCAL can reduce the runtime overhead of the Linux Audit System from 40% to 15%, and the space overhead by 90% on average.
5.5 Edge-grouping-based Reduction
LogGC [
65], ProTracer [
75], and KCAL [
73] require the presence of unit instrumentation to be effective. The drawbacks of unit instrumentation are that the source code needs to be accessible and that the instrumentation itself adds significant runtime overhead. To address this issue, two edge-grouping methods, coined
Causality-Preserving Reduction (CPR) and
Process-centric Causality Approximation Reduction (PCAR), were proposed in [
112].
CPR is based on the observation that only a small number of key events show causal importance to other events. Thus, irrelevant events can be removed, and shadowed events can be aggregated with their key event. Figure
6 shows an example graph, whereby process
A is the
Point of Interest (POI) for forward tracing in the forensic analysis. The graph clearly shows that first, event
E5 is a shadow event of event
E2, and thus, the semantic information such as the timestamp can be aggregated. Second, event
E3 is an irrelevant event that can be removed because it doesn’t have any effect on the result of forward tracing in a forensic analysis [
112].
PCAR is based on the observation that there are processes that produce an intense burst of events, such as scanning for files or devices, which are semantically similar but cannot be reduced by CPR due to their interleaved causalities. PCAR detects such burst events, creates a neighbor set around the burst event, and checks traceability only for information flow from and into the neighbor set. With this approach, approximately shadowed events within the neighbor set can be detected and aggregated. In Figure
7 process
C is a burst event, the dotted circle shows its neighbor set, and event
E3 is an approximately shadowed event. It can be aggregated with event
E2 even though it has interleaved causalities. Event
E5 and event
E6, however, cannot be aggregated as their interleaved event
E7 is an information flow going outside of the neighbor set. As a result, CPR can reduce the space overhead by 56% and in combination with PCAR by 70% [
112]. While the combination of CPR [
112] and PCAR [
112] reduce the space overhead by a factor of 1.8, they do not consider the global context of events.
Continuous Dependence (CD),
Full Dependence (FD) and
Source Dependence (SD) preservation can further improve the space overhead reduction rate by considering the global context [
51].
CD preservation reduction works similar to CPR [
112] and PCAR [
112], but also aggregates duplicate events by their global reachability properties by considering the context of the event itself rather than by checking their local interleaving causalities. By applying CD preservation to the graph in Figure
8, event
E1 can be aggregated with event
E2 even though there is the interleaving causality of event
E3 [
51].
In Figure
9 the previous graph has been extended by process
D and thus, CD preservation reduction cannot aggregate the events
E3 and
E4 anymore. Nevertheless, the aggregation of those events would not affect the forward- and backward tracing in forensic analysis. Therefore, FD preservation aggregates events by checking if the resulted reduced graph would generate the same output for forward- and backward tracing as the original graph, and thus, events
E3 and
E4 can be aggregated again. On average, FD preservation can reduce the space overhead by a factor of 7 [
51].
To further reduce the space overhead, SD preservation removes events that do not affect the forward- and backward tracing in forensic analysis. Given the example graph in Figure
10, events
E5 and
E6 can be removed because the forward tracing from node
A to
E and backward tracing from node
E to
A on the reduced graph still results in the same set of nodes than applying it on the original graph. SD achieves a reduction factor of 9.2 [
51].
One drawback of CD, FD, and SD is that they depend on global properties of graphs, and thus, computing it on a timestamped graph is expensive, mainly because the reachability changes over time. As a result, the authors proposed to convert the timestamped graph into a naive versioned graph and then apply different optimization techniques to reduce the number of edges and versions.
GrAALF [
93] is a system for forensic analysis that collects data from heterogeneous sources, stores the data in one or multiple of the provided backend storage solutions, and enables real-time forward and backward tracing by using their proposed query language. To store the provenance data in multiple backend storage solutions efficiently, three graph summarization methods were proposed (see Figure
11).
Lossless Compression (C1) aggregates the edge properties for causalities with the same subject node, object node, and edge type. The accuracy of
C2, is the same as
C1 but only keeps the first and last occurrence of edge properties.
Lossy Compression (C3), is the same as
C1 but only keeps the first occurrence of edge properties. However, no evaluation of these graph summarization methods is provided so their effectiveness cannot be compared with other approaches.
5.6 Node-grouping-based Reduction
Another approach to reducing the space overhead is NodeMerge [
99], which is based on the observation that processes produce many redundant events during their initialization, such as loading libraries, accessing read-only resources, or retrieving configurations. NodeMerge detects and summarizes such event patterns by first, creating
Frequence Access Pattern (FAP)s, second, automatically learning templates from the FAPs based on an optimized
Frequent Pattern (FP)-Growth algorithm, and third, using those templates to compress the further event data. Thus, the template-based approach can reduce the space overhead by 75 times for raw data and by 32 times compared to previous approaches such as LogGC [
65] or CPR/PCAR [
112]. The approach is particularly efficient for hosts who repeatedly run the same processes, but it may not be as efficient for hosts who execute mainly write-intensive processes. Figure
12 shows an example graph in which process
B reads each initialization the files
D -
F, which NodeMerge detects and summarizes as template
T1 to reduce the space overhead.
Winnower is the first graph summarization approach that offers scalability for clusters. Replicated microservices in a cluster generate both structurally and semantically similar provenance graphs [
48]. In Winnower, firstly, deterministic and node-specific information gets removed to create an abstract provenance graph on each worker node. Secondly, on each worker node, the abstract provenance graph gets converted into a behavior model by using
Deterministic Finite State Automata (DFA) learning to generate graph grammar. Thirdly, the behavior models are aggregated into a unified model on a master node, which gets sent back to all worker nodes. Additionally, the unified model adds a confidence level to each node in the graph to reflect the consensus across the worker nodes. For example, a subgraph of the unified model with a low confidence level indicates that this behavior only occurred on one or a few worker nodes and thus, could represent an anomalous activity. Lastly, new provenance data on each worker node is checked if the unified model already models the data. If not, the behavior model gets updated and sent to the master node for aggregation. The graph in Figure
13 shows the resulting provenance graph by using winnower for monitoring cluster-wide behavior. For the nodes,
A -
C the confidence level is high, which implies that the worker nodes generate homogeneous behavior. However, for the nodes
D -
E the confidence level is low, which indicates that the behavior was generated by a single or only a few nodes, and those could reflect malicious behavior and have to be further analyzed. Winnower achieves a space overhead reduction of 98% while maintaining the important information required for attack investigation.
The authors proposed an attack-preserving graph summarization approach, called LogApprox, based on the observation that most of the storage of provenance data is occupied by I/O events (88.97%). LogApprox generates regular expressions to describe benign I/O events and then uses these regular expressions to summarize the provenance graph. Figure
14 shows an example graph in which process
B writes to multiple files. LogApprox detects these I/O events, creates a regular expression, and uses the regular expression to summarize the I/O events. The authors evaluated LogApprox against previous approaches such as LogGC [
65], CPR [
112], FD, and SD [
51] by using their proposed metrics. The results show that only LogApprox and CPR achieve the highest forensic validity for attack-preserving. LogApprox could further achieve a higher data reduction rate than CPR. FD and SD achieved the highest data reduction rates but also the lowest forensic validity for attack-preserving forensics.