Scalable time series forecasting and anomaly detection

Published in

Data Science at Microsoft

7 min readMar 15, 2022

Time series forecasting and anomaly detection is important for many businesses in today’s data-intensive world. It finds use across diverse industries including IT, manufacturing, retail, health care, banking, and finance. It also has applications for sales forecasting, inventory analysis, intrusion detection, fraud detection, and production system monitoring, among others.

In the Microsoft Cloud Data Sciences (MCDS) team where I work, we have multiple use cases for time series forecasting and anomaly detection in the spaces of Azure finance and commerce, involving a significantly large number of time series. Formally we can define the problem in this way:

Given N univariate time series representing daily data, forecast and detect whether there is an anomaly on a day T.

Forecasting and detecting anomalies in a univariate time series is a well-known problem and many solutions exist that are quite effective. But as we increase the number of time series, these solutions often do not scale well. For our use case, N can reach 100,000, prompting us to create a solution that can scale to the volume we require.

Solution

Our solution is best thought of as having the following parts: The basic idea, the forecasting method, scalability constraints, efficient hyperparameter optimization, and anomaly detection.

Basic idea

The simplest approach to detecting anomalies in a univariate time series is to forecast for day T using previous time steps and methods like ARIMA, Prophet, and so on, and then use the difference between the forecast and actual values to check whether anomalies exist.

Forecasting method

Prophet is a widely used technique for time series forecasting. Since its inception, Prophet has gained popularity due to its ease of implementation and its ability to fit a variety of time series with minor tweaks. Prophet is an additive regression model, where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. We use Prophet as our forecasting method.

The Prophet model equation. Source: How does Prophet work? by Deepti Goyal on Medium.com.

Prophet hyperparameters

Change_point_prior_scale is used to fit the trend changes in the time series. If its value is very low it will not be able to capture the trend changes, and with very high values it will capture all the trend changes and overfit.
Seasonality_point_prior_scale controls the flexibility of the seasonality. A large value allows the seasonality to fit large fluctuations, and a small value shrinks the magnitude of the seasonality.
Seasonality_mode has two options, additive and multiplicative.

Scalability constraints

In general, Prophet is very fast to fit a time series and get accurate forecasts using the default hyperparameters. As mentioned earlier, for our use case, N is approximately 100,000 and so we also leverage Python’s parallel processing to get the forecasts as soon as possible.

Generally, Prophet performs quite well with the default hyperparameters, but we can further improve performance by searching for optimal hyperparameters using a Bayesian optimization framework like Optuna or Hyperopt. But, alas, even a mere 20 trials of Optuna training with 100,000 time series on a Standard D64s v3 (64 VCPUs, 256 GB memory) Azure VM takes more than 300 hours — 12.5 days — which is not only inefficient but also impractical for our model because it must run daily.

Efficient hyperparameter optimization

Due to the constraints mentioned above, we leverage the intuitive parameters of the Prophet model to limit the hyperparameter optimization to only a few hundred time series and use them to derive the hyperparameters of the remaining time series.

For example, let’s say we have a universal set U consisting of all our time series, and a small subset A of U that is randomly sampled. We then take the following steps:

Get optimal hyperparameters of small subset A (for example, some 500 time series) using Optuna.
For each time series in the universal set U, we find the most similar time series in the subset A. (We discuss the similarity metric to use below.)
The hyperparameter to be used for each time series in the universal set U is the optimal hyperparameter of its nearest time series in subset A, for which the hyperparameters were calculated in step 1.

With this approach, we were able to reduce the time necessary to forecast 100,000 time series to just four or five hours.

Similarity metric: DTW

We use Dynamic Time Warping (DTW) to calculate similarity among time series. DTW is a dynamic programming algorithm for measuring the similarity of two temporal sequences that may vary in speed. It is a very robust technique for comparing two or more time series by ignoring any shifts in speed.

Euclidean Matching and Dynamic Time Warping Matching. Source: Dynamic Time Warping on paperswithcode.com.

Intuition

Now let’s take a moment to understand intuitively why we can substitute the hyperparameter of a given time series with a time series of high similarity. If we look at the hyperparameters of Prophet that we are trying to tune, they are representative of the characteristics of the time series, and that is essentially what DTW captures. In other words, two time series that are very similar according to the DTW metric should have similar hyperparameters.

Because we are dealing with such a large number of time series, we can expect that different time series will have different scales, and so we normalize the time series while calculating the DTW distance to ensure that only the temporal structure is used to make the comparison.

Anomaly detection

Along with the forecasts, the Prophet model provides a lower and upper bound of the forecasts: the uncertainty interval. This is the credible interval (Bayesian-equivalent confidence interval) of the posterior predictive distribution that Prophet is trying to estimate. We can configure the width of this interval using the interval_width parameter (default = 0.80). The higher the interval width, the bigger the confidence interval. Because our use case is anomaly detection, we choose an interval_width of 0.99 to minimize false alarms.

If the actual value is outside the uncertainty interval, i.e., lower than the lower bound or higher than the upper bound, we say it’s an anomaly.

Forecast evaluation

To measure the error in the forecast we use Symmetric Median Absolute Percentage Error (SMAPE).

Definition of SMAPE. Source: How to find symmetric mean absolute error in python? on stackoverflow.com.

Because we are dealing with multiple time series at once, we use the following metric to evaluate our model: We calculate the SMAPE for each time series for a forecast horizon of one day, and then group the SMAPE values into 10 buckets with a 10 percent interval, and then try to maximize the number of time series with SMAPE less than 10 percent.

We compared our results on the two sets of time series that used default hyperparameters versus the efficient hyperparameter optimization technique discussed above.

Forecast set 1 (73,000 time series)

In this set we fitted 73,000 univariate time series and evaluated the model, including at least 40 data points for each training time series. The percentage of time series with SMAPE below 10 percent is 77 percent with tuned hyperparameters from our approach as compared to 71 percent for the default.

Forecast set 2 (20,000 time series)

In this set we fitted 20,000 univariate time series and evaluated the model, including at least 40 data points for each training time series. The percentage of time series with SMAPE of less than 10 percent is 81 percent with tuned hyperparameters from our approach as compared to 78 percent for the default.

Because we don’t have any historical labels for anomalies, we don’t have a way of evaluating the anomaly detection system; however, we are planning to manually tag our anomalies as false or true every day on real data points, and then use that to evaluate our anomaly detection systems.

Conclusion

Our goal was to detect anomalies for multiple time series accurately. We used Prophet as a forecasting technique, and then compared the forecast and actual value to decide anomalies. Prophet with optimal hyperparameters did not scale with our large number of time series, and so our solution relied on finding the optimal hyperparameters for a very small set of time series and then using them to derive nearly optimal hyperparameters for the rest of the time series. With this approach we were able to forecast 100,000 time series within four to five hours on a Standard D64s v3 (64 VCPUs, 256 GB memory) Azure VM, which previously was taking more than 300 hours on the same machine. Additionally, the accuracy was better with this approach in comparison to using default hyperparameters, proving the effectiveness of our solution.

Thanks for reading this article. I would appreciate reading your feedback and comments if you care to leave them using the Comments function below.

Sourav Khemka is on LinkedIn.