1.1 Motivation
Large-scale cloud systems log system events for several purposes, such as system modeling [
11,
32], error diagnosing [
19,
61,
82], user behavior profiling [
21,
23,
50], and security attack detecting [
24,
63]. Large cloud providers can easily generate up to PBs of logs per day [
54,
70,
78], and thus often choose to compress these logs to reduce storage cost; furthermore, they sometimes need to query these compressed logs for the purposes discussed above.
We studied the log access pattern in Alibaba Cloud, a major cloud provider and our collaborator. We observed that these logs can be categorized into three types:
online logs are mainly used for monitoring system states and are queried frequently;
near-line logs are mainly used for debugging and thus are queried only when a problem occurs; after a certain period (typically 6–12 months [
67,
78]), logs will be archived as
offline logs.
The difference in their query patterns motivates different tradeoffs between storage costs and query latency. Online logs are queried frequently but do not need to be stored for a long time. They thus prefer methods that compromise storage cost for a lower query latency [
8,
42]. Offline logs need to be stored for a long time but are rarely queried ; they thus prefer methods that tradeoff query latency for a high compression ratio [
13,
16,
39,
54,
55,
57,
71,
75,
78]. Near-line logs require (1) a high compression ratio like offline logs to reduce the storage cost, since it takes up a large part of cloud log storage and needs to be stored for a relatively long time, (2) a low query latency even though they are not queried as frequently as online logs. An engineer expects a query to finish as quickly as possible since a delayed query time will reduce productivity when debugging. During interviews, engineers from Alibaba Cloud conveyed their views that a query completion time of a few seconds is preferred and that a completion time of less than one minute is deemed acceptable. However, a level of dissatisfaction is reported for queries that consume several minutes or more.
We have tested several existing works, including
ElasticSearch (
ES) [
8], CLP [
67], and so on, and found none that achieve both goals. For example, it takes on average 14 minutes to execute a query using CLP, the state-of-the-art approach to query on compressed logs.
1.2 Contributions
This article aims at designing a compression and query method for near-line cloud logs with two objectives. First, it should minimize the overall cost including computation cost to compress logs, storage cost to store compressed logs, and computation cost to query logs. Second, it should limit query latency to an acceptable level.
To achieve both objectives, we adopt a classic idea called “vertical partitioning” in data processing systems. This idea is to break a single data entry into multiple fragments, compress similar fragments from different entries into a partition, and generate a summary for each partition to avoid decompressing irrelevant partitions when executing a query [
15,
26,
51,
52,
73].
There are two key challenges in realizing such an idea in the log storage field. The first is how to partition data entries during the compression process, such that the content in each partition shares common features, which will allow us to generate strict summaries to filter as many irrelevant partitions as possible. The second is how to mitigate the read amplification incurred by the vertical-partitioning-based method. When using such a method, fragments of each log entry are stored in multiple partitions and each partition contains fragments from different log entries. Once a log entry is to be reconstructed, we need to read and decompress all contents in the partitions containing the fragments of such log entry, which incurs read amplification.
To solve the first challenge, we follow the three steps below.
Structurizing logs by exploiting static patterns. We first leverage existing log parsing methods to structurize log entries into templates and variables [
30,
38,
39,
54,
60,
74,
75,
78,
87], because values of the same variable are more likely to share common features [
54,
78]. For example, if an application has a log output statement “printf(“write to file:%s”, filepath)”, log parsing methods can parse a corresponding log entry into the template “write to file:” and a variable “filepath”. Since the string template “‘write to file:” is specified by the developer, we call the template a
static pattern in the rest of this article. After parsing log entries, we organize values of the same variable (called a
variable vector) into a partition. Compared to storing variable values following their original order in the logs [
3,
67,
84], our method tends to store values that may share common features in the same partition, which is beneficial for both compression and generating strict summaries.
In our experiments, with a state-of-the-art log parser [
78], this approach can reduce query latency by about 5.72× and improve the compression ratio by about 2.01×, compared with CLP. However, many queries still take more than one minute, which is still unacceptable. We find the main reason is that summaries generated for the whole variable vectors are often still too general to enable efficient filtering.
Further improving the filtering efficiency by exploiting runtime patterns. To further improve the filtering efficiency, our key idea is to exploit runtime patterns within each variable vector. Unlike a static pattern specified by a programmer, a runtime pattern is generated by the application at run time. In our previous example, all values of “filepath” may follow a pattern “/tmp/1FF8<*>.log”, which is a runtime pattern.
We find runtime patterns are ubiquitous after exploring a wide range of production logs. These runtime patterns can help to filter keywords and we find they have a key feature: a variable part of the same runtime pattern (referred to as a sub-variable) often includes limited types of characters and has a similar length. For example, in our filepath runtime pattern “/tmp/1FF8< \(C_1\) >.log”, values of the sub-variable vector \(C_1\) all include 4 hexadecimal characters. By exploiting this key feature, we propose two optimizations. First, in addition to partitioning data into variable vectors, we further partition data into fine-grained sub-variable vectors and generate summaries on them. These summaries are stricter and allow more effective filtering. Second, we pad the values of each sub-variable vector to a fixed length to enable efficient keyword search and locating methods with minimal impact on compression ratio.
Extracting runtime patterns automatically. Extracting runtime patterns automatically, however, is challenging. General-purpose pattern extraction algorithms [
4,
59,
62,
86] are too slow given the scale of our logs. As a result, prior works extract log patterns by (1) analyzing the source/binary code [
9,
12,
80,
85], which only works for static patterns, (2) first splitting logs into tokens with user-defined delimiters and then regarding constant tokens among logs as patterns [
30,
39,
45,
56], which works well for static patterns but poorly for runtime patterns, since runtime patterns are more versatile and cannot be tokenized with user-defined delimiters, or (3) setting default patterns or asking the developer to manually provide patterns [
67], which is certainly not ideal.
To address this challenge, we design a novel runtime pattern extraction method based on the following observation:
variable vectors which do not include many duplicated values are usually dominated by a single runtime pattern. Following this observation, we first categorize variable vectors based on how many of their values are duplicated. We call variable vectors with a small percentage of duplicated values as
real variable vectors and variable vectors with many duplicated values as
nominal variable vectors. For real variable vectors, under the assumption that they only include one pattern, we design a
tree expanding approach [
44] to extract their patterns, which has
\(O(n)\) time complexity (
n is the number of unique values in the sub-variable vector) and can extract finer runtime patterns. For nominal variable vectors, considering the fact that their values have many duplicates, and we only need to extract patterns on deduplicated values. The complexity of the pattern extraction algorithm presents less of a problem. Therefore, we design a
pattern merging approach [
39,
56], which has a time complexity of
\(O(n\log {n})\) but can extract multiple patterns.
To solve the second challenge, we follow the two steps below.
Observations on interactive debugging session. We observe that when an error occurs, the engineers usually query the logs interactively and incrementally: they first query the logs with coarse-grained keywords, like “ERROR”; then they will browse the temporal distribution of hit entries and only check a small number of them to come up with more fine-grained queries, like “ERROR not 404”, to narrow down the search space; they will repeat this procedure till they can locate the root cause (see a detailed example in Section
2.4). We call such a procedure
interactive debugging and call multiple successive incremental queries from one engineer a
session. Based on an analysis of 70,406,619 real-world debugging sessions in Alibaba Cloud, we find that 87.79% of them are interactive debugging sessions. Such interactive debugging can well mitigate the read amplification of the vertical-partitioning-based method: a later query only refines the locating result of the prior one and an engineer can only read a limited number of rows, and thus it is not necessary to reconstruct all hit entries.
Incremental locating and partial reconstruction. Based on such observation, we introduce incremental locating techniques to locate the positions of hit entries among different partitions incrementally. Such a technique is based on a well-designed structure called Indexed Bitmap. This structure can record and maintain the previous locating results, based on which the following locating process can only check and prune previous hit entries. Indexed Bitmap enables checking and pruning with a complexity of \(O(1)\) by utilizing an index array and fixed-length padding. We also introduce partial reconstruction to only return the locating results and limited reconstructed entries. This is enough for interactive debugging and will significantly mitigate the read amplification incurred by the vertical-partitioning-based method.
Based on these ideas, we have designed and implemented LogGrep, a tool that can compress logs with a high compression ratio and support Linux grep-like commands on these compressed logs. On 21 Alibaba Cloud production logs and 16 public logs, we compare LogGrep with CLP [
67], the state-of-the-art method to compress and execute text query on logs, ES [
8], a method focusing more on query latency, and gzip+grep, the current method used by Alibaba Cloud. Our evaluation shows that first, LogGrep can usually complete a query within a minute: this is an order of magnitude faster than CLP and gzip+grep, and is comparable to ES. Second, by considering the storage and computation cost in Alibaba Cloud, the overall cost of LogGrep is 36% of CLP, 7% of ES, and 34% as much as that of gzip+grep.
Our contributions can be briefly summarized as the five points below.
—
We propose LogGrep, the first vertical-partitioning-based log compression and query tool that structurizes log data in fine-grained units. We demonstrate that the proper structurization method enables simple but effective summaries to accelerate queries on compressed log data.
—
To the best of our knowledge, LogGrep is the first one to extract runtime patterns automatically to improve the data filtering efficiency. To achieve that, we propose a novel runtime pattern extraction method by separating real and nominal variable vectors.
—
We observe that the pattern of interactive debugging can well mitigate the weakness of vertical-partitioning-based log storage. We introduce incremental locating and partial reconstruction techniques to mitigate the read amplification during locating and reconstruction procedures, respectively.
—
We evaluated LogGrep on 21 real-world production logs [
5] and found LogGrep achieves a significant query latency reduction and a considerable cost saving over the state-of-the-art system.
—
We make LogGrep open-sourced [
6].