1 Introduction
Time series are relevant in several contexts, and the
Internet of Things (
IoT) ecosystem is among the most pervasive ones. IoT devices, indeed, can be found in different applications, ranging from health care (smart wearables) to industrial ones (smart grids) [
1], producing a large amount of time-series data. For instance, a single Boeing 787 fly can produce about half a terabyte of data from sensors [
2]. In those scenarios, characterized by high data rates and volumes, time series compression techniques are a sensible choice to increase the efficiency of collection, storage, and analysis of such data. In particular, the need to include in the analysis both information related to the recent and the history of the data stream leads to considering data compression as a solution to optimize space without losing the most important information. A direct application of time series compression, for example, can be seen in Time Series Management Systems (or Time Series Database) in which compression is one of the most significant steps [
3].
There exists an extensive literature on data compression algorithms, both on generic purpose ones for finite-size data and on domain-specific ones, for example, for images, video, and audio data streams. This survey aims at providing an overview of the state-of-the-art in time series compression research, specifically focusing on general-purpose data compression techniques that are either developed for time series or working well with time series.
The algorithms we chose to summarize can deal with the continuous growth of time series over time and are suitable for generic domains (as in the different applications in the IoT). Furthermore, these algorithms take advantage of the peculiarities of time series produced by sensors, such as
–
Redundancy: some segments of a time series can frequently appear inside the same or other related time series;
–
Approximability: sensors in some cases produce time series that can be approximated by functions;
–
Predictability: some time series can be predictable, for example using deep neural network techniques.
The main contribution of this survey is to present a reasoned summary of the state-of-the-art in time series compression algorithms, which are currently fragmented among several sub-domains ranging from databases to IoT sensor management. Moreover, we propose a taxonomy of time series compression techniques based on their approach (dictionary-based, functional approximation, autoencoders, sequential, others) and their properties (adaptiveness, lossless reconstruction, symmetry, tuneability of max error or minimum compression ratio), anticipated in visual form in Figure
1 and discussed in Section
3, that will guide the description of the selected approaches. Finally, we recapitulate the results of performance measurements indicated in the described studies.
1.1 Outline of the Survey
Section
2 describes the method we applied to select the literature included in this survey. Section
3 provides some definitions regarding time series, compression, and quality indexes. Section
4 describes the compression algorithms and is structured according to the proposed taxonomy. Section
5 summarizes the experimental results found in the studies that originally presented the approaches we describe. A summary and conclusions are presented in Section
6.
3 Background
3.1 Time Series
Time series are defined as a collection of data, sorted in ascending order according to the timestamp \(t_i\) associated with each element. They are divided into
–
Univariate Time Series (UTS): elements inside the collection are real values;
–
Multivariate Time Series (MTS): elements inside the collections are arrays of real values, in which each position in the array is associated with a time series feature.
For instance, the temporal evolution of the average daily price of a commodity as the one represented in the plot in Figure
2 can be modeled as a UTS, whereas the summaries of daily exchanges for a stock (including opening price, closing price, the volume of trades and other information) can be modeled as an MTS.
Using a formal notation, time series can be written as
where
\(n\) is the number of elements inside a time series and
\(m\) is the vector dimension of MTS. For UTS,
\(m = 1\) . Given
\(k\in [1, n]\) , we write
\(TS[k]\) to indicate the
\(k\) th element
\((t_k, x_k)\) of the time series
\(TS\) .
A time series can be divided into segments, defined as a portion of the time series without any missing elements and ordering preserved:
where
\(\forall k \in [i, j], TS[k] = TS_{[i,j]}[k-i+1]\) .
3.2 Compression
Data compression, also known as
source coding, is defined in [
4] as “the process of converting an input data stream (the source stream or the original raw data) into another data stream (the output, the bitstream, or the compressed stream) that has a smaller size”. This process can take advantage of the
Simplicity Power (
SP) theory, formulated in [
5], in which the compression goal is to remove redundancy while having high descriptive power.
The decompression process, complementary to the compression one, is indicated also as source decoding, and tries to reconstruct the original data stream from its compressed representation.
Compression algorithms can be described with the combination of different classes, shown in the following list:
–
Non-adaptive—adaptive: a non-adaptive algorithm that does not need a training phase to work efficiently with a particular dataset or domain since the operations and parameters are fixed, while an adaptive one does;
–
Lossy—lossless: algorithms can be lossy if the decoder does not return a result that is identical to original data, or lossless if the decoder result is identical to original data;
–
Symmetric—non-symmetric: an algorithm is symmetric if the decoder performs the same operations of the encoder in reverse order, whereas a non-symmetric one uses different operations to encode and decode a time series.
In the particular case of time series compression, a compression algorithm (encoder) takes in input one Time Series \(TS\) of size \(s\) and returns its compressed representation \(TS^{\prime }\) of size \(s^{\prime }\) , where \(s^{\prime } \lt s\) and the size is defined as the bits needed to store the time series: \(E(TS) = TS^{\prime }\) . From the compressed representation \(TS^{\prime }\) , using a decoder, it is possible to reconstruct the original time series: \(D(TS^{\prime }) = \overline{TS_s}\) . If \(\overline{TS} = TS_s\) then the algorithm is lossless, otherwise it is lossy.
In Section
4, there are shown the most relevant categories of compression techniques and their implementation.
3.3 Quality Indices
To measure the performances of a compression encoder for time series, three characteristics are considered: compression ratio, speed, and accuracy.
Compression ratio. This metric measures the effectiveness of a compression technique, and it is defined as
where
\(s^{\prime }\) is the size of the compressed representation and
\(s\) is the size of the original time series. Its inverse
\(\frac{1}{\rho }\) is named
compression factor. An index used for the same purpose is the
compression gain, defined as
Accuracy, also called distortion, measures the fidelity of the reconstructed time series with respect to the original. It is possible to use different metrics to determine fidelity [
6]:
–
Mean Squared Error: \(MSE = \frac{\sum _{i = 1}^{n}{(x_i - \overline{x}_i)^2}}{n}\) .
–
Root Mean Squared Error: \(RMSE = \sqrt {MSE}\) .
–
Signal to Noise Ratio: \(SNR = \frac{\sum _{i = 1}^n{x_{i}^{2}/n}}{MSE}\) .
–
Peak Signal to Noise Ratio: \(PSNR = \frac{x_{pick}^2}{MSE}\) where \(x_{peak}\) is the maximum value in the original time series.
4 Compression Algorithms
In this section, we present the most relevant time series compression algorithms by describing in short summaries their principles and peculiarities. We also provide a pseudo-code for each approach, focusing more on style homogeneity than on the faithful reproduction of the original pseudo-code proposed by the authors. For a more detailed description, we refer the reader to the complete details available in the original articles. Below the full list of algorithms described in this section, divided by approach:
(2)
Functional Approximation (FA)
2.1
Piecewise Polynomial Approximation (PPA);
2.2
Chebyshev Polynomial Transform (CPT);
2.3
Discrete Wavelet Transform (DWT);
2.4
Discrete Fourier Transform (DFT);
2.5
Discrete Cosine Transform (DCT).
(3)
Autoencoders:
3.1
Recurrent Neural Network Autoencoder (RNNA);
3.2
Recurrent Convolutional Autoencoder (RCA);
(4)
Sequential Algorithms (SA):
4.1
Delta encoding, Run-length, and Huffman (DRH);
4.3
Run-Length Binary Encoding (RLBE);
(5)
Others:
5.1
Major Extrema Extractor (MEE);
5.3
Continuous Hidden Markov Chain (CHMC).
RNNA algorithm can be applied both to univariate and multivariate time series. The other algorithms can also handle MTS by extracting each feature as an independent time series. All the algorithms accept time series represented with real values.
The methods we considered span a temporal interval of more than 20 years, with a significant outlier dating back to the ’80s. Figure
3 graphically represents the temporal trend of adoption of the different approaches for the methods we considered.
4.1 Dictionary-based (DB)
This approach is based on the principle that time series share some common segments, without considering timestamps. These segments can be extracted into atoms, such that a time series segment can be represented with a sequence of these atoms. Atoms are then collected into a dictionary that associates each atom with a univocal key used both in the representation of time series and to search efficiently their content. The choice of atoms length should guarantee a low decompression error and maximize the compression factor at the same time. Algorithm
1 shows how the training phase works at a high level:
createDictionary function computes a dictionary of segments given a dataset composed of time series and a threshold value
th.
The find function searches in the dictionary if there exists a segment that is similar to the segment s with a distance lower than threshold th; a possible index of distance can be \(MSE\) . If a match is found, the algorithm merges the segment s with the matched one to achieve generalization. Larger th value results in higher compression and lower reconstruction accuracy, and a dictionary with lower dimension.
After the segment dictionary is created, the compression phase takes in input one time series, and each segment is replaced with its key in the dictionary. If some segments are not present in the dictionary, they are left uncompressed, or a new entry is added to the dictionary. Algorithm
2 shows how the compression phase works: the
compress function takes in input a time series to compress, the dictionary created during the training phase, a threshold value, and returns the compressed time series as a list of indices and segments.
The compression achieved by this technique can be either lossy or lossless, depending on the implementation.
The main challenges for this architecture are to
–
maximize the searching speed to find time series segments in the dictionary;
–
make the time series segments stored in the dictionary more general as possible to minimize the distance in the compression phase.
4.1.1 TRISTAN.
One implementation of the DB architecture is TRISTAN [
7], an algorithm divided into two phases: the learning phase and the compression phase.
Learning phase. The dictionary used in this implementation can be created by domain experts that add typical patterns or by learning a training set. To learn a dictionary from a training set
\(T = [t_1, \dots , t_n]\) of
\(n\) segments, the following minimization problem has to be solved:
where \(D\) is the obtained dictionary, \(sp\) is a fixed parameter representing sparsity, \(w_i\) is the compressed representation of segment \(i\) , and \(w_i D\) is the reconstructed segment. The meaning of this formulation is that the solution to be found is a dictionary that minimizes the distance between original and reconstructed segments.
The problem shown in Equation (
5) is NP-hard [
7], thus an approximate result is computed. A technique for approximating this result is shown in [
8].
Compression phase. Once the dictionary is built, the compression phase consists in finding
\(w\) such that:
where
\(D\) is the dictionary,
\(s\) is a segment and
\(w \in \lbrace 0,1\rbrace ^k\) , and
\(k\) is the length of the compressed representation. An element of
\(w\) in position
\(i\) is 1 if the
\(a_i \in D\) is used to reconstruct the original segment, 0 otherwise.
Finding a solution for Equation (
6) is an NP-hard problem and, for this reason, the matching pursuit method [
9] is used to approximate the original problem:
where \(sp\) is a fixed parameter representing sparsity.
Reconstruction phase. Having a dictionary
\(D\) and a compressed representation
\(w\) of a segment
\(s\) , it is possible to compute
\(\overline{s}\) as
4.1.2 CORAD.
This implementation extends the idea presented in TRISTAN [
10]. The main difference is that it adds autocorrelation information to get better performances in terms of compression ratio and accuracy.
Correlation between two time series
\(TS^A\) and
\(TS^B\) is measured with the Pearson correlation coefficient:
where
\(x_i\) is an element of
\(TS^A_n\) ,
\(y_i\) is an element of
\(TS^B_n\) and
\(\overline{x},\overline{y}\) are mean values of the corresponding time series. This coefficient can be applied also to segments that have different ranges of values and
\(r \in [-1, q]\) where 1 expresses the maximum linear correlation,
\(-\) 1 the maximum linear negative correlation, and 0 no linear correlation.
Time series are divided into segments and time windows are set. For each window, correlation is computed between each segment belonging to it, and results are stored in a correlation matrix \(M\in \mathbb {R}^{n \times n}\) where \(n\) is the segment number of elements.
Compression phase. During this phase, segments are sorted from the less correlated to the most correlated. To do this, a correlation matrix \(M\) is used. The metric used to measure how much one segment is correlated with all the others is the absolute sum of the correlations, computed as the sum of each row of \(M\) .
Knowing correlation information, a dictionary is used only to represent segments that are not correlated with others, as in TRISTAN implementation. While the other segments are represented solely using correlation information.
Reconstruction phase. The reconstruction phase starts with the segment represented with dictionary atoms while the others are reconstructed looking at the segment to which they are correlated.
This process is very similar to the one proposed in TRISTAN, with the sole difference that segments represented in the dictionary are managed differently than those that are not.
4.1.3 Accelerometer LZSS (A-LZSS).
A-LZSS is an algorithm built on top of the LZSS algorithm [
11] for searching matches [
12]. A-LZSS algorithm uses Huffman codes, generated offline using frequency distributions. In particular, this technique considers blocks of size
\(s=1\) to compute frequencies and build the code: an element of the time series will be replaced by a variable number of bits. Moreover, larger blocks can be considered and, in general, having larger blocks gives better compression performances at the cost of larger Huffman code tables.
The implementation of this technique is shown in Algorithm
3, where:
–
minM: the minimum match length, which is asserted to be \(\texttt {minM} \gt 0\) ;
–
Ln: determines the lookahead distance, as \(2^{Ln}\) ;
–
Dn: determines the dictionary atoms length, as \(2^{Dn}\) ;
–
longestMatch: is a function that returns the index \(I\) of the found match and the length L of the match. If the length of the match is too small, then the Huffman code representation of s is sent as the compression representation, otherwise, the index and the length of the match are sent, and the next L elements are skipped.
This implementation uses a brute-force approach, with complexity \(O(2^{Dn}\cdot 2^{Ln})\) but it is possible to improve it by using hashing techniques.
4.1.4 Differential LZW (D-LZW).
The core of this technique is the creation of a very large dictionary that grows over time: once the dictionary is created, if a buffer block is found inside the dictionary it is replaced by the corresponding index, otherwise, the new block is inserted in the dictionary as a new entry [
13].
Adding new blocks guarantees lossless compression, but has the drawback of having too large dictionaries. This makes the technique suitable only for particular scenarios (i.e, input streams composed of words/characters or when the values inside a block are quantized).
Another drawback of this technique is how the dictionary is constructed: elements are simply appended to the dictionary to preserve the indexing of previous blocks. For a simple implementation of the dictionary, the complexity for each search is \(O(n)\) where \(n\) is the size of the dictionary. This complexity can be improved by using more efficient data structures.
4.2 Function Approximation (FA)
The main idea behind function approximation is that a time series can be represented as a function of time. Since finding a function that approximates the whole time series is infeasible due to the presence of new values that cannot be handled, the time series is divided into segments and for each of them, an approximating function is found.
Exploring all the possible functions \(f: T \rightarrow X\) is not feasible, thus implementations consider only one family of functions and try to find the parameters that better approximate each segment. This makes the compression lossy.
A point of strength is that it does not depend on the data domain, so no training phase is required since the regression algorithm considers only single segments in isolation.
4.2.1 Piecewise Polynomial Approximation (PPA).
This technique divides a time series into several segments of fixed or variable length and tries to find the best polynomials that approximate segments.
Despite the compression is lossy, a maximum deviation from the original data can be fixed a priori to enforce a given reconstruction accuracy.
The implementation of this algorithm is described in [
14] where the authors apply a greedy approach and three different online regression algorithms for approximating constant functions, straight lines, and polynomials. These online algorithms are
–
the PMR-Midrange algorithm, which approximates using constant functions [
15];
–
the optimal approximation algorithm, described in [
16], which uses linear regression;
–
the randomized algorithm presented in [
17], which approximates using polynomials.
The algorithm used in [
14] for approximating a time series segment is explained in Algorithm
4.
This algorithm finds repeatedly the polynomial of degree between 0 and a fixed maximum that can approximate the longest segment within the threshold error, yielding the maximum local compression factor. After a prefix of the stream has been selected and compressed into a polynomial, the algorithm analyzes the following stream segment. A higher value of \(\epsilon\) returns higher compression and lower reconstruction accuracy. The fixed maximum polynomial degree \(\rho\) affects compression speed, accuracy, and compression ratio: higher values slow down compression and reduce the compression factor, but return higher reconstruction accuracy.
4.2.2 Chebyshev Polynomial Transform (CPT).
Another implementation of polynomial compression can be found in [
18]. In this article, the authors show how a time series can be compressed in a sequence of finite Chebyshev polynomials.
The principle is very similar to the one shown in Section
4.2.1 but based on the use of a different type of polynomial. Chebyshev Polynomials are of two types,
\(T_n(x)\) ,
\(U_n(x)\) , defined as [
19]:
where
\(n \ge 0\) is the polynomial degree.
4.2.3 Discrete Wavelet Transform (DWT).
DWT uses wavelet functions to transform time series. Wavelets are functions that, similarly to a wave, start from zero, and end with zero after some oscillation. An application of this technique can be found in [
20].
This transformation can be written as
where
\(a \gt 1\) ,
\(b \gt 0\) ,
\(m,n\in Z\) .
To recover one transformed signal, the following formula can be applied:
4.2.4 Discrete Fourier Transform (DFT).
Together with the DWT, the DFT is commonly used for signal compression. Given a time series
\([(t_1, x_1), \dots , (t_n, x_n)]\) , it is defined as
Where \(e^{\frac{i2\pi }{N}}\) is a primitive \(N\) th root of 1.
To efficiently compute this transformation, the Fast Fourier transform algorithm can be used, as the one introduced here [
42]. As described in [
43], compressing a time series can be done after applying the DFT by cutting coefficients that are close to zero. This can be done also by fixing a compression ratio and discarding all the coefficients after a certain index.
4.2.5 Discrete Cosine Transform (DCT).
Given a time series
\([(t_1, x_1), \dots , (t_n, x_n)]\) , the DCT is defined as
The computation of this transformation is computationally expensive, so efficient implementations can be used, as the one presented in [
44].
Similarly to the DFT, compression of time series can be achieved by cutting the coefficients that are close to zero or by fixing a compression ratio and discarding all the coefficients after a certain index. An example of this technique can be found in [
45].
4.3 Autoencoders
An autoencoder is a particular neural network that is trained to give as output the same values passed as input. Its architecture is composed of two symmetric parts: encoder and decoder. Giving an input of dimension
\(n\) , the encoder gives as output a vector with dimensionality
\(m \lt n\) , called code, while the decoder reconstructs the original input from the code, as shown in Figure
4 [
21].
4.3.1 Recurrent Neural Network Autoencoder (RNNA).
RNNA compression algorithms exploit recurrent neural networks [
22] to achieve a compressed representation of a time series. Figure
5 shows the general unrolled structure of a recurrent neural network encoder and decoder. The encoder takes in input time series elements, which are combined with hidden states. Each hidden state is then computed starting from the new input and the previous state. The last hidden state of the encoder is passed as the first hidden state of the decoder, which applies the same mechanism, with the only difference being that each hidden state provides an output. The output provided by each state is the reconstruction of the relative time series element and is passed to the next state.
New hidden state
\(h_t\) is obtained by applying:
where
\(\phi\) is a logistic sigmoid function or the hyperbolic tangent.
One application of this technique is shown in [
23], in which also Long Short-Term Memory [
24] is considered. This implementation compresses time series segments of different lengths using autoencoders. The compression achieved is lossy and a maximal loss threshold
\(\epsilon\) can be enforced.
The training set is preprocessed considering temporal variations of data applying:
where
\(\mathcal {L}\) is the local time window.
The value obtained by Equation (
15) is then used to partition the time series, such that each segment has a total variation close to a predetermined value
\(\tau\) .
Algorithm
5 shows an implementation of the RNN autoencoder approach, with an error threshold
\(\epsilon\) , where:
–
RAE a is the recurrent autoencoder trained on a training set, composed of an encoder and a decoder;
–
getError computes the reconstruction error between the original and reconstructed segment;
–
\(\epsilon\) is the error threshold value.
4.3.2 Recurrent Convolutional Autoencoder (RCA).
A variation of the technique presented in the previous subsection is proposed in [
49] and consists of adding convolutional layers. Convolutional layers are used mainly for images data to extract local features. In the case of time series, this layer allows extracting local fluctuations and addressing complex inputs.
4.3.3 DZip.
DZip is a lossless recurrent autoencoder [
47]. This model also uses the prediction technique: it tries to predict the next symbol, having, in this case, a fixed vocabulary. To achieve a lossless compression, it combines two modules: the bootstrap model and the supporter model, as shown in Figure
6.
The bootstrap model is composed by stacking two bidirectional gated recurrent units, followed by linear and dense layers. The output of this module is a probability distribution, that is, the probability of each symbol in the fixed vocabulary to be the next one.
The supporter model is used to better estimate the probability distribution with respect to the bootstrap model. It is composed by stacking three neural networks which act as independent predictors of varying complexity.
The two models are then combined by applying the following equation:
Where \(o_b\) is the output of the bootstrap model, \(o_s\) is the output of the supporter model, and \(\lambda \in [0, 1]\) is a learnable parameter.
4.4 Sequential Algorithms
This architecture is characterized by combining sequentially several simple compression techniques. Some of the most used techniques are
–
Fibonacci binary encoding.
These techniques, summarized below, are the building blocks of the methods presented in the following subsections.
Huffman coding. Huffman coding is the basis of many compression techniques since it is one of the necessary steps, as for the algorithm shown in Section
4.4.4.
The encoder creates a dictionary that associates each symbol with a binary representation and replaces each symbol of the original data with the corresponding representation. The compression algorithm is shown in Algorithm
6 [
6].
The createPriorityList function creates a list of elements ordered from the less frequent to the most frequent, addTree function adds to the tree a father node with its children and createDictionary assigns 0 and 1, respectively, to the left and right arcs and creates the dictionary assigning to characters the sequence of 0 \(|\) 1 in the route from the root to a leaf that represents the characters. Since the priority list is sorted according to frequency, more frequent characters are inserted closer to the root and they are represented using a shorter code.
The decoder algorithm is very simple since the decoding process is the inverse of the encoding process: the encoded bits are searched in the dictionary and replaced with the corresponding original symbols.
Delta encoding. This technique encodes a target file with respect to one or more reference files [
25]. In the particular case of time series, each element at time
\(t\) is encoded as
\(\Delta (x_t, x_{t-1})\) .
Run-length. In this technique, each run (a sequence in which the same value is repeated consecutively) is substituted with the pair
\((v_t, o)\) where
\(v_t\) is the value at time
\(t\) and
\(o\) is the number of consecutive occurrences [
26].
Fibonacci binary encoding. This encoding technique is based on the Fibonacci sequence, defined as
Having
\(F(N) = a_1\dots a_p\) where
\(a_j\) is the bit at position
\(j\) of the binary representation of
\(F(N)\) , the Fibonacci binary coding is defined as
Where
\(a_i \in \lbrace 0,1\rbrace\) is the
ith bit of the binary representation of
\(F(N)\) [
27].
4.4.1 Delta Encoding, Run-length, and Huffman (DRH).
This technique combines together three well-known compression techniques [
28]: delta encoding, run-length encoding, and Huffman code. Since these techniques are all lossless, the compression provided by DRH is lossless too if no quantization is applied.
Algorithm
7 describes the compression algorithm, where
\(Q\) is the quantization level and is asserted to be
\(\ge\) 1: if
\(Q = 1\) the compression is lossless, and increasing values of
\(Q\) return a higher compression factor and lower reconstruction accuracy.
The decompression algorithm is the inverse of the compression one: once data is received, it is decoded using the Huffman code and reconstructed using the repetition counter.
Since this kind of algorithm is not computationally expensive, the compression phase can be performed also by low-resource computational units, such as sensor nodes.
4.4.2 Sprintz.
Sprintz algorithm [
29] is designed for the IoT scenario, in which energy consumption and speed are important factors. In particular, the goal is to satisfy the following requirements:
–
Handling of small block size;
–
High decompression speed;
–
Lossless data reconstruction.
The proposed algorithm is a coder that exploits prediction to achieve better results. In particular, it is based on the following components:
–
Forecasting: used to predict the difference between new samples and the previous ones through delta encoding or FIRE algorithm [
29];
–
Bit packing: packages are composed of a payload that contains prediction errors and a header that contains information that is used during reconstruction;
–
Run-length encoding: if a sequence of correct forecasts occurs, the algorithm does not send anything until some error is detected and the length of skipped zero error packages is added as information;
–
Entropy coding: package headers and payloads are coded using Huffman coding, presented in Section
4.4.
4.4.3 Run-Length Binary Encoding (RLBE).
This lossless technique is composed of five steps, combining delta encoding, run-length, and Fibonacci coding, as shown in Figure
7 [
30].
This technique is developed specifically for devices characterized by low memory and computational resources, such as IoT infrastructures.
4.4.4 RAKE.
RAKE algorithm, presented in [
31], exploits sparsity to achieve compression. It is a lossless compression algorithm with two phases: preprocessing and compression.
Preprocessing. In this phase, a dictionary is used to transform original data. For this purpose, many algorithms can be used, such as the Huffman coding presented in Section
4.4, but since the aim is that of obtaining sparsity, the RAKE dictionary uses a code similar to unary coding thus every codeword has at most one bit set to 1. This dictionary does not depend on symbol probabilities, so no learning phase is needed. Table
1 shows a simple RAKE dictionary.
Compression. This phase works, as suggested by the algorithm name, as a rake of
\(n\) teeth. Figure
8 shows an execution of the compression phase, given a preprocessed input and
\(n=4\) .
The rake starts at position 0, in case there is no bit set to 1 in the rake interval, then 0 is added to the code, otherwise 1 and followed by the binary representation of relative index for the first bit set to 1 in the rake (2 bits in the example in Figure
8). After that, the rake is shifted right of
\(n\) positions for the first case or starting right to the first found bit. In the figure, the rake is initially at position 0, and the first bit set to 1 is in relative position 1 (output: 1 followed by 01), then the rake advances of 2 positions (after the first 1 in the rake); all bits are set to zero (output: 0) thus that rake is moved forward by 4 places; the first bit set to 1 in the rake has relative index 2 (output: 1 followed by 10) thus the rake is advanced by 3 places and the process continues for the two last rake positions (output: 101 and 0, respectively).
Decompression. The decompression processes the compressed bit stream by replacing 0 with \(n\) occurrences of 0 and 1 followed by an offset with a number 0 equal to the offset followed by a bit set to 1. The resulting stream is decoded on the fly using the dictionary.
4.5 Others
In this subsection, we present other time series compression algorithms that cannot be grouped in the previously described categories.
4.5.1 Major Extrema Extractor (MEE).
This algorithm is introduced in [
32] and exploits time series features (maxima and minima) to achieve compression. For this purpose, strict, left, right, and flat extrema are defined. Considering a time series
\(TS = [(t_0, x_0), \dots , (t_n,x_n)]\) ,
\(x_i\) is a minimum if it follows these rules:
–
strict: if \(x_i \lt x_{i-1} \wedge x_i \lt x_{i+1}\)
–
left: if \(x_i \lt x_{i-1} \wedge \exists k \gt i : \forall j \in [i, k], x_j = x_i \vee x_i \lt x_{k+1}\)
–
right: if \(x_i \lt x_{i+1} \wedge \exists k \lt i : \forall j \in [k, i], x_j = x_i \vee x_i \lt x_{k-1}\)
–
flat: if ( \(\exists k \gt i : \forall j \in [i, k], x_j = x_i \vee x_i \lt x_{k+1}) \wedge (\exists k \lt i : \forall j \in [k, i], x_j = x_i \vee x_i \lt x_{k-1})\)
For maximum points, they are defined similarly.
After defining minimum and maximum extrema, the authors introduce the concept of importance, based on a distance function
dist and a compression ratio
\(\rho\) :
\(x_i\) is an important minimum if
Important maximum points are defined similarly.
Once important extrema are found, they are used as a compressed representation of the segment. This technique is a lossy compression technique, as the SM one described in the next section. Although the compressed data can be used to obtain original data properties useful for visual data representation (minimum and maximum), it is impossible to reconstruct the original data.
4.5.2 Segment Merging (SM).
This technique, presented in [
48] and reused in [
33] and [
50], considers time series with regular timestamps and repeatedly replaces sequences of consecutive elements (segments) with a summary consisting of a single value and a representation error, as shown in Figure
9 where the error is omitted.
After compression, segments are represented by tuples
\((t, y, \delta)\) where
\(t\) is the starting time of the segment,
\(y\) is the constant value associated with the segment, and
\(\delta\) is the segment error. The merging operation can be applied either to a set of elements or to a set of segments, to further compress a previously compressed time series. The case of elements is an instance of the case of segments, and it is immediate to generalize from two to more elements/segments. We limit our description to the merge of two consecutive segments represented by the tuples
\((t_i, y_i, \delta _i)\) and
\((t_j, y_j, \delta _j)\) , where
\(i \lt j\) , into a new segment represented by the tuple
\((t, y, \delta)\) computed as
where
\(\Delta t_x\) is the duration of the segment
\(x\) , thus
\(\Delta t_i = t_j - t_i\) ,
\(\Delta t_j = t_k - t_j\) and
\(t_k\) is the timestamp of the segment after the one starting at
\(t_j\) .
The sets of consecutive segments to be merged are chosen to minimize segment error with constraints on a maximal acceptable error and maximal segment duration.
This compression technique is lossy and the result of the compression phase can be considered both as the compressed representation and as the reconstruction of the original time series, without any additional computations executed by a decoder.
4.5.3 Continuous Hidden Markov Chain (CHMC).
The idea behind this algorithm is that the data generation process follows a probabilistic model and can be described with a Markov chain [
34]. This means that a system can be represented with a set of finite states
\(S\) and a set of arcs
\(A\) for transition probabilities between states.
Once the hidden Markov chain is found using known techniques, such as the one presented in [
35], a lossy reconstruction of the original data can be obtained by following the chain probabilities.
4.6 Algorithms Summary
When we apply the taxonomy described in Sections
3.2 and
4 to the above techniques we obtain the classification reported in Table
2 that summarizes the properties of the different implementations. To help the reader in visually grasping the membership of the techniques to different parts of the taxonomy, we report the same classification graphically in Figure
1 using Venn diagrams.
Where \(\checkmark\) indicates if the related property is true,—if it is not, and NA if it is not applicable. Min \(\rho\) and max \(\epsilon\) indicate, respectively, the possibility to set a minimum compression ratio or a maximum reconstruction error.
6 Conclusions
The amount of time series data produced in several contexts requires specialized time series compression algorithms to store them efficiently. In this article, we provide an analysis of the most relevant ones, we propose a taxonomy to classify them, and report synoptically the accuracy and compression ratio published by their authors for different datasets. Even if this is a meta-analysis, it is informative for the reader about the suitability of a technique for specific contexts, for example, when a fixed compression ratio or accuracy is required, or for a dataset having particular characteristics.
The selection of the most appropriate compression technique highly depends on the time series characteristics, the constraints required for the compression, or the purpose.
Starting from the first case, time series can be characterized by their regularity (that means, having or not having fixed time intervals between two consecutive measurements), sparsity, fluctuations, and periodicity. Considering
regularity, only techniques based on function approximation can be applied directly to the time series without the need for transformations. This is because that approach approximates a function passing through the given points, having in the
x-axis time, so the time-distance between consecutive points can be arbitrary. All the other techniques assume the time-distance between two points to be fixed. To apply them, some transformations on the original time series are needed. Some techniques work better in input time series with high
sparsity, as the ones proposed in Section
4.4. Another characteristic to consider is
fluctuations. The lossy compression techniques presented in this survey cannot achieve good accuracy results in input time series with high non-regular fluctuations. This is caused by the fact that FA and autoencoder techniques tend to flatten peaks, while the ones based on dictionaries would need overly large dictionaries. On the other hand, FA and lossy autoencoders can be taken into consideration if it is sufficient to reconstruct the general trend of the original time series. Lastly, DB techniques work well with time series with high
periodicity. In this scenario, a relatively small dictionary could be enough to represent the whole time series by combining atoms. Lossy autoencoders also improve their reconstruction accuracy, in input periodical time series.
The constraints required for compression include compression ratio, reconstruction accuracy, speed, and availability of a training dataset. As shown in Table
2, some techniques allow to specify a minimum
compression ratio and a maximum
reconstruction accuracy error. This could be necessary for some contexts in which space or accuracy are given as constraints. Moreover, compression
speed can be relevant, in particular when hardware with low computation capacity is involved. In this context, SA can compress at high-speed, without needing powerful devices. Depending on the presence or not of a
training set, all the techniques that need a training phase must be discarded.
The different purposes for time series compression include analysis, classification, and visualization. For time series
analysis, FA and lossy autoencoders can help to reduce noise and outlayers, for those time series that have many fluctuations. Autoencoders can be used also to more deep analysis, as anomaly detections, starting from the compressed representation, as shown in [
46]. The CHMC technique can also be used for this aim, since time series can be represented with a probability graph. Autoencoders can be used for
classification too, since the autoencoder can be used to reduce the dimensionality of the input. Lastly, for
visualization purposes, MEE and SM can be used to have different definitions of long time series.