Time Series Anomaly Detection

Models and System Architectures: ARMA, LSTM-VAE, SR-CNN

Published in

Towards Data Science

7 min readApr 22, 2021

Introduction

In the real world, we often care about monitoring key signals to make sure they follow expected patterns of behavior and if something unexpected occurs we’d like to be able to explain it. Signals of this nature are known as time-series and there are several methods to both predict and detect anomalies in time-series signals. Examples of time series vary from key performance indicators such as volumes of sales to personal finance budgets to stock market closing price. We are going to look at three models starting from the classic ARMA model and following up with two modern deep-learning approaches LSTM-VAE and SR-CNN. In addition to the models themselves, we are going to touch on the systems architecture and look at the interplay between systems and models.

Time Series Signals. Image by Canva. Free Images License.

ARMA

As a warm-up, let’s start with a classic model for time-series data: Auto-Regressive Moving-Average (ARMA) model. Let X_t be our signal X at time t, we assume that X_t depends linearly on the previous values X_{t-1},…,X_{t-p} where p is the order of auto-regression. Thus AR(p) is defined as:

where \phi_1,…,\phi_p are auto-regression parameters that can be learned based on historical data and used to predict or find similar time-series, c is a constant and \epsilon_t ~ N(0,\sigma²) is Gaussian noise.

The Moving Average (MA) model of order q regresses on the noise terms \epsilon_t. Thus MA(q) is defined as:

Where \theta_1,…,\theta_q are learnable parameters of the model, \mu is the expectation of X_t and \epsilon_{t-i} ~ N(0, \sigma²) are iid Gaussian noise terms.

We can combine the two complimentary views of modeling X_t into a single equation giving rise to the ARMA(p,q) model:

Using historical data, we can select p and q in ARMA(p,q) model and learn the model coefficients {\theta} and {\phi} based on which we can make a future prediction. A substantial deviation from the prediction result in time-series anomaly. We can define substantial as two standard deviation from a moving average of X. The model parameteters {\theta} and {\phi} are learned via Maximum Likelihood Estimation (MLE) as implemented in the statsmodels package [1]. Given the simplicity of the model it may be sufficient for many applications.

LSTM-VAE

Let’s take a look at a more sophisticated model: LSTM-VAE, Variational Auto-Encoder (VAE) with LSTM encoder and decoder modules for representing time-series data. The architecture of the model is shown in Figure 1.

The input consists of n signals x_1,…,x_n and the output is log probability of observing input x_i under normal (non-anomalous) training parameters {\mu_i, \sigma_i}. Which means that the model is trained on non-anomalous data in an unsupervised fashion and when an anomaly does occur on a given input x_i the corresponding log likelihood log p(x_i |{\mu_i, \sigma_i}) will drop and we can threshold the resulting drop to signal an anomaly.

We assume a Gaussian likelihood and thus every sensor has two degrees of freedom {\mu, \sigma} to represent an anomaly. As a result, for n input sensors we learn 2n output parameters (mean and variance) that are able to differentiate anomalous vs normal behavior.

While the input signals are independent, they are embedded in a joint latent space by the VAE in the sampling layer. The embedding is structured as a Gaussian that approximates standard normal N(0, 1) by minimizing KL divergence.

The model is trained in unsupervised fashion with an objective function that achieves two goals: 1) it maximizes log-likelihood output of the model averaged over sensors and 2) it structures the embedding space to approximate N(0,1):

Let’s look at how we can implement this model in a scalable and modular way. We will Dockerize our anomaly detection model and create a number of supporting micro-services such as data ingestion, training, inference and notification services all implemented as Docker containers and orchestrated by Kubernetes. We will use Kafka message bus for ingesting sensor input data and outputting notifications in addition to a NoSQL database for any required storage.

Figure 3: Microservices Software Architecture. Image by Vadim Smolyakov

Figure 3 shows the micro-service, multi-container software architecture. Connected as input are three sensors: a video camera that counts the number of people present in a room, an audio sensor that outputs db level sound amplitude and a raspberry pi device used as a diverse signal generator. The output is a notification receiver that notifies of time-series anomaly.

As a simple example consider a square wave input with a spike indicating an anomaly. As shown in Figure 4, we can detect a drop in the log likelihood and threshold it to signal an anomaly.

Figure 4: A spike anomaly is detected as shown by the drop in the log likelihood. Image by Vadim Smolyakov.

In summary, state-of-the-art deep learning in combination with micro-service, multi-container software architecture leads to a scalable and modular solution to real-time anomaly detection.

SR-CNN

Spectral Residual (SR) CNN [3] takes a computer vision view of the problem of anomaly detection. SR-CNN is a novel algorithm that borrows SR model from visual saliency detection domain and applies it to time-series anomaly detection [3]. Figure 5 shows the deep learning architecture. To quote the authors:

“The Spectral Residual (SR) algorithm consists of three major steps: 1) Fourier Transform to get the log amplitude spectrum, 2) calculation of spectral residual [wrt average spectrum] and 3) Inverse Fourier Transform that transforms the time-series back to spatial domain…The spectral residual serves as a compressed representation of the sequence while the innovation part of the original sequence becomes more significant.”

The resulting transformation is called the saliency map and is shown in Figure 6.

Figure 6: Spectral Residual (SR) Saliency Map [3]. Image by the authors of the paper [3].

The saliency map is fed into two 1-D convolutional layers followed by two fully connected layers with filter size equal to the sliding window size. The model is trained to minimize cross entropy loss function using stochastic gradient descent. It is trained in a supervised way and requires anomaly labels. The model achieves high F-1 scores on different benchmarks (see [3] for details).

The overall system is implemented in three major components: data ingestion, online compute and experimentation platform. The time-series are ingested into influxDB and Kafka by the ingestion worker with throughput varying from 10K to 100K data points per second.

Figure 7: SR-CNN System Architecture [3]. Image by the authors of the paper [3].

In the online compute module, anomaly detection processor calculates the anomaly status for incoming time-series signal online, while the alert processor sends out notifications if an anomaly occurs. Finally, in the experimentation platform, model performance is evaluated before it is deployed. In addition, users are provided a service to label anomaly regions in the time-series data. The experimentation platform is built on Azure machine learning service.

In summary, SR-CNN model provides a computer vision take on the problem of time-series anomaly detection. It is highly scalable and capable of accurately detecting anomalies from millions of signals in production.

Conclusion

Humans are naturally drawn to patterns. We like to create mental models of patterns we find in time-series data while being constrained by time to only look at the most important signals. AI helps scale anomaly detection to millions of signals by employing scalable software architecture and state of the art deep learning models. We started with the basic ARMA model and extended our understanding to a probabilistic LSTM-VAE model with a latent representation of time-series data. We also looked at anomaly detection from a computer vision point of view via a clever combination of salience maps and convolutional neural networks. In both cases, we find that whether we are searching for predictability in patterns or trying to discover a new law of physics, AI can help with anomaly detection.

References

[1] S. Seabold and J. Perktold, “statsmodels: econometric and statistical modeling with python”: https://www.statsmodels.org/

[2] Park et al., “A Multimodal Anomaly Detector for Robot-Assisted Feeding Using an LSTM-Based Variational Autoencoder”, arXiv 2018: https://arxiv.org/pdf/1711.00614.pdf

[3] Hansheng Ren et. al, “Time-Series Anomaly Detection Service at Microsoft”, KDD 2019: https://arxiv.org/abs/1906.03821