Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Time series forecasting (Part 3 of 3): Introducing AUTS (Adaptive Univariate Time Series forecasting algorithm)

Yasmin Bokobza
Data Science at Microsoft
17 min readApr 27, 2021

By Yasmin Bokobza and Siddharth Kumar

This is our third article in a series in which we focus on time series forecasting methods and applications. In the first article, we discussed how to choose the right forecasting method and how to make the forecasting task simpler. We also provided a list of some popular Python packages for performing forecasting and time series analysis. In the second article, we presented our approach to algorithm selection by walking through the capabilities of the Univariate Forecast Engine that we developed. In addition, we presented how we enabled stakeholders, who are not necessarily data scientists, to move quickly on their time series problems and make high-quality forecasts. In this article, we discuss approaches to time series forecasting with an emphasis on what led us to develop the Adaptive Univariate Time Series (AUTS) algorithm for the forecasting tasks we have encountered. We also delve into details of the AUTS methodology, including how we deployed the model using Microsoft Azure, and we provide guidance for tackling your own business problems. Finally, we share the results and validation for our business forecasting scenario within Microsoft.

Related work

There are several widely used approaches to time series forecasting. The most common are exponential smoothing [2, 3] and ARIMA models. While exponential smoothing models are based on a description of trend and seasonality in the data, ARIMA models aim to describe autocorrelations in the data [1]. A drawback of ARIMA models is that the process of including seasonal covariates and automatically selecting the best one leads to extremely long fitting times [4]. In addition, we have observed that in our own use cases, automatic ARIMA [5] and exponential smoothing forecasts are prone to large trend errors when there is a change in trend near the cutoff period, and they do not do well with multiple seasonality.

Long Short-Term Memory (LSTM) networks for time series forecasting have also been widely used since the time their effectiveness in large-scale time series forecasting was proven by early researchers [6]. In some of our own business forecast use cases, however, we have had to build on-the-fly forecasting models and predict in real time. As a result, we did not proceed with this approach because it does not meet our latency constraints.

Facebook Prophet is a well-known forecasting model that is based on a decomposable time series model [7] with three main model components: trend, seasonality, and holidays. The Prophet model is fast and designed to have intuitive parameters that can be adjusted easily. In addition, the model can easily accommodate seasonality with multiple periods [4]. We have found, however, that the Facebook Prophet has not done well with shifts in trend and outliers in the business forecast tasks that we have encountered.

Our work with these various approaches has helped us create a specialized solution for our own forecast scenarios. In the next section, we review and discuss details of the forecast model methodologies that we have applied to our business problems.

The AUTS algorithm

As mentioned in Part 2, one of the evaluated models in the engine is AUTS. This is a generic machine learning–based time series forecasting algorithm built on top of a univariate forecast model. In this section we describe the methodology of our univariate forecast model, discuss the high-level architecture of AUTS, and introduce our workflow for scalable model deployment using Microsoft Azure.

Univariate forecast model methodology

Our univariate forecast model deals well with shifts in trend, outliers, and multiple seasonalities such as weekly, monthly, and so on. In addition, because it is a continuous-time model, missing data are not an issue. Unlike in ARIMA models, which are defined in terms of lagged variables, the measurements in our model do not need to be regularly spaced, and we do not need to interpolate missing values. The univariate forecast model methodology is depicted in Figure 1.

Figure 1: Univariate forecast model methodology.

First, we feed the model with historical data, such as customer usage or purchase history. Then, the model applies a series of automatic pre-processing steps. While there are several widely used time series methods for some of these steps, we use methods that are relatively fast, simple, and efficient, because the model must be able to apply the pre-processing steps under latency constraints. The processing steps can be divided into four main actions:

1. Adjusting historical data: Such adjustments can often lead to a simpler forecasting task. An example of such an adjustment is a calendar adjustment in which variations caused by calendar effects are removed before fitting a forecasting model [1]. For example, variations in data can exist from month to month simply because of the different number of days in each month, in addition to the seasonal variation across the year. By looking at an average daily instead of a total monthly, we effectively remove variation due to different month lengths. Simpler patterns are usually easier to model and lead to more accurate forecasts.

2. Mathematical transformations: If the data show variations that increase or decrease with the level of the series, then a transformation can be useful. An example of such transformations that depend on the value of a Boolean parameter can be log(At), where A1….At are the actual observations. Logarithmic transformations are often useful because logarithms are interpretable: Changes in a log value are relative (or percentage) changes on the original scale and they constrain the forecasts to stay positive on the original scale [1].

3. Removing outliers: The outliers detection problem for a time series is usually formulated as finding outlier data points relative to some standard or usual signal. Outliers, or discordant observations, introduce bias in the model parameter estimates and may increase the confidence intervals for the model parameters. Therefore, outliers can strongly influence predictions. The univariate forecast model removes outliers using Tukey’s test [12] that defines an outlier to be any observation outside the range:

Equation 1: Tukey’s test.

Where Q1 and Q3 are the lower and upper quartiles, respectively, and IQR-Q3-Q1. The k term is a constant; the greater the value of k, the more extreme outliers are removed. While there are several widely used time series outlier detection methods for removing global and local anomalies, we have found that Tukey’s test is the most fast, simple, and efficient for the business forecast tasks we have encountered. Figure 2 presents an example of outlier detection with actual data using Tukey’s test.

Figure 2: Outlier detection in real customer data using Tukey’s test.

4. Extracting features on the tail of the time series: To identify if there are relatively extreme trend changes in the time series, we extract features from the tail of historical data (after outliers have been removed). The tail includes the most recent data points in the time series, which for our purposes (based on experiments) are the last 25 percent of points. If tail features are significantly different from prior 75 percent, we presume there is a change in trend. In such cases, the model makes the prediction using just the tail of the time series (which in some cases reduces running time).

After these processing steps, we model the time series features, which are trend, seasonality, and noise. The trend component models non-periodic changes in the values of the time series and includes two optional trend models: a saturating growth model and a piecewise linear model. These models detect trend changes by explicitly detecting possible places where the growth rate can change. To accommodate seasonality with multiple periods, we use a flexible model of periodic effects. The last component is the noise that represents random variation in the series. As is the case with many univariate time-series algorithms, the assumption is that the noise is normally distributed.

Finally, all model components combine into one that is used for forecasting. The model can be multiplicative (the components multiplied together) or additive (the components added together), depending on the characteristics of the time series data. Besides producing forecasts, the univariate forecast model also estimates uncertainty intervals for the forecast. The uncertainty intervals provide an upper and lower expectation for the real observation. These can be useful for assessing the range of actual possible outcomes for a prediction and to provide a way to quantify and communicate the uncertainty in a prediction. The assumption is that there are two main sources of uncertainty in the forecast: uncertainty in the trend and additional observation noise. To estimate the uncertainty in the trend, we assume that the average frequency and magnitude of trend changes in the future will be the same as was observed in the history. The trend model projects these trend changes forward and, by computing their distribution, produces the uncertainty intervals.

AUTS architecture

AUTS is a generic machine learning–based time series forecasting algorithm implemented in Python. It is fast, can be tuned easily, and provides completely automated forecasts. Automatic forecasting is widely reviewed in the literature, and there are many methods adjusted for specific types of time series [8, 9]. Our approach is driven by both the nature of the business forecast tasks that we have encountered at Microsoft as well as the challenges involved in forecasting at scale. The high-level architecture of the AUTS algorithm can be divided into three stages, as depicted in Figure 3 (ML-guided pre-processing, features extraction, and prediction).

Figure 3: High-level architecture of the AUTS Algorithm.

The first stage aims to optimize the model parameters dependent on the latency constraints of the forecast problem. Namely, if there are no latency constraints, for each training time series we fit the model with the best combinations of parameters values. Otherwise, the parameters of the model are those selected in the ML-guided pre-processing stage. Examples of tunable parameters include: A parameter that specifies the number of time-steps in the historical data used to train the model, parameters of the anomaly removal method, Boolean parameters that control the performance of mathematical transformations, parameters that set the trend model type (saturating growth or a piecewise linear model), parameters that increase or decrease trend flexibility and the strength of the seasonality component, and a parameter that sets the components’ composition (multiplicative or additive).

During the ML-guided pre-processing step, for each training time series we fit a univariate forecast model with the best combinations of parameter values as described below. By using a rolling window, out-of-time validation approach, we evaluate the performance of the univariate forecast models with different combinations of parameter values. That is, one training record is generated for each parameter combination and selected cutoff date in the time series. The target variables are the SMAPE values. Care is taken to ensure proper selection of the rolling windows in the historical training data, such that there is no overlap with the rolling windows selected by the engine in the evaluation stage. This enables us to ensure no data leakage occurs. Figure 4 presents an example of choosing the best combination of parameter values for a given time series. For simplicity we assume one time series, two possible combinations of parameters, one cutoff date selected for choosing the best parameter combinations, and one cutoff date selected by the engine in the evaluation stage. The best combination for the given time series will be the combination with the lowest SMAPE value.

Figure 4: Example of choosing the best combination of parameter values for a given time series.

In the ML-guided pre-processing stage, we also extract features from the training time series that may have an influence on the tuning of the learning parameters. An example of such a feature is the magnitude of the time series slope that is calculated by using the Theil–Sen slope estimator proposed by Theil [10] and extended by Sen [11]. The Theil–Sen estimator of a set of two-dimensional points (xᵢ, xⱼ) is the median m of the slopes (yⱼ yᵢ)/(xⱼ xᵢ) defined from pairs of sample points having distinct x coordinates. This slope estimator is nonparametric, which means that it doesn’t draw from any particular probability distribution. It can be computed efficiently and is insensitive to outliers. We have found that the magnitude of a time series’ slope is significant for tuning the learning parameters of the univariate forecast model.

Based on the chosen parameter combinations for each time series and the time series features, we build a decision tree. The decision tree output is used as feedback to update the learning parameters of the univariate forecast model and improve its performance. Namely, we can derive rules from the decision tree output for all the time series, such as the threshold for the magnitude of the slope, from which we use a specific parameter combination. Figure 5 presents an example of possible input and output for the decision tree at the ML-guided pre-processing stage. For simplicity, we assume a single time series feature (magnitude of the slope), two possible combinations of parameters, and four time series.

Figure 5: Example of possible input and output for the decision tree at the ML-guided pre-processing stage.

Next, if there are latency constraints, the features extraction process is conducted. In this stage we extract features from the inference time series. These are the same features that we used in the ML-guided pre-processing step. Finally, during the prediction process, the univariate forecast model produces forecasting output for the inference time series. In this stage, if there are latency constraints, the model parameters are influenced by the rules derived from the decision tree in the ML-guided pre-processing stage and the model produces forecasts using the inference time series features. Otherwise, if there are no latency constraints, the forecasts are producing by using a model adjusted to each time series. Thanks to multi-processing and ML-guided pre-processing (which can be performed in offline mode within a pre-scheduled period), AUTS deals well with latency constraints.

Model deployment

Model deployment is very important for a practical scenario. Figure 6 illustrates the workflow for AUTS model deployment in our forecasting applications. First, the relevant data is extracted from multiple SQL and/or Kusto databases. This data is stored in our blob storage using Microsoft Azure Data Factory (ADF), the Azure cloud ETL service for scaling out serverless data integration and data transformation.

Figure 6: Scalable model deployment using Microsoft Azure.

Next, the deployment of the final AUTS architecture is distributed using Microsoft Azure Machine Learning (AML), which is a cloud-based service for creating and managing machine learning solutions. AML enables distributed computing by breaking the problem into smaller tasks in a multi-core environment on multiple nodes in the cluster.

Next, the output is written back into our blob storage and then sent to the SQL and/or Kusto databases. Because most of our end users are advanced SQL users, they tend to want to have their own version of the output, which they can request for their convenience. Finally, the output can also integrate into dashboards or APIs for use by end users. These dashboards are usually dynamic — as soon as the data refreshes, they are updated accordingly. The model deployment pipeline can simply run at the desired frequency thanks to the use of ADF.

Business forecasting scenario: Results and validation

We use AUTS in a variety of business forecasting problems. The model output provides significant assistance to and insights for stakeholders by leveraging all our historical growth pattern data. As mentioned in Part 2, we repeatedly leverage the training and validation framework provided by the CGA Univariate Forecast Engine to evaluate the performance of different forecasting models. In this section we present a single scenario in which AUTS and other models have been deployed and then we gauge their performance and accuracy, with AUTS among them.

  1. Problem description: In this scenario our aim was to help teams across Microsoft be more efficient with their Azure usage by increasing the transparency of their Azure costs. To do that, we predicted future investments in subscriptions based on historical data among Azure services and third-party Marketplace offerings. This prediction is used by teams across Microsoft to anticipate their future investments better so they can optimize and build their budgets accordingly. The main challenges in this forecasting problem were to deal with outliers, changes in trend, and a lack of historical data.
  2. Dataset: The evaluation was performed with approximately 5000 unique subscriptions representing high Azure investments. This makes them useful for monitoring model performance.
  3. Historical data granularity: Daily, weekly, and monthly.
  4. Forecast granularity: Daily, weekly, and monthly.
  5. Evaluation results: Table 1 presents the performance of the top three models at each granularity and at different forecast horizon sizes averaged over all the subscriptions for five rolling-windows (using different cutoff dates as described in Part 2). The top three models are AUTS, Facebook Prophet [4], and Kusto [12]. The granularities and forecast horizons that were used to evaluate performance were defined by stakeholders. According to the results shown in Table 1, AUTS outperforms all other models under consideration.
Table 1: Performance of the models averaged over all the subscriptions.

To evaluate model accuracy, we also analyzed the SMAPE grouping. Figure 7 presents the SMAPE grouping of the top three models for forecasts in daily, weekly, and monthly granularities. The results indicate that AUTS is more accurate because, when we apply it, the number of forecasts with SMAPE between 0 and 10 percent is highest, and the number of forecasts with SMAPE between 90 and 100 percent is lowest.

Figure 7: SMAPE grouping of the three top models for forecasts in daily, weekly, and monthly granularities.

To evaluate the performances of our models, it is important to analyze their stability as well: Namely, the spread of the SMAPE values. Figure 8 presents the SMAPE distribution of the three top models for forecasts in daily, weekly, and monthly granularities. Because the distribution is skewed, the standard deviation gives no information on the asymmetry of the distribution. In such a case the first (Q1) and third (Q3) quartiles can indicate more about the spread of the distribution. Q1 and Q3 of AUTS are the lowest, indicating that its distribution is more symmetrical and there is a higher density round lower SMAPE values. From these results we can conclude that the AUTS model is more balanced.

Figure 8: SMAPE distribution of the top three models for forecasts in daily, weekly, and monthly granularities.

To get a better understanding on the SMAPE values distribution, we compared the distribution of the negative errors (over-forecasting) and positive errors (under-forecasting). Figure 9 presents, for the selected forecasts granularities, the SMAPE distribution of the over-forecasting and under-forecasting cases in each of the three top models. From these results we can conclude that for all the forecasts’ granularities, AUTS tends to be more stable and accurate, because its SMAPE values are more concentrated around the median and there are fewer extreme cases (above the third quartile) of positive errors. Namely, by looking at the distribution tails of the under-forecasting cases, the Facebook Prophet and Kusto models predict no costs for a higher number of cases, although in practice there are costs (i.e., spikes at the ends of tails).

Figure 9: Daily, weekly, and monthly granularity.

As part of the evaluation, we also examined the success of the generated prediction intervals. For each of the forecasts we computed the actual coverage percentage of the prediction intervals generated by the AUTS and Facebook Prophet models (the top two models). Figure 10 presents the percentage of real observations within each forecast’s prediction intervals with a common confidence level of 80 percent (selected based on experiments).

The results indicate that for each data granularity, the average coverage percentage (the smooth line in the plots) of the AUTS prediction intervals is higher than that of Facebook Prophet. Namely, the AUTS model generates a higher number of forecast prediction intervals with coverage close to the desired confidence level. The differences in model performance can be attributed to the fact that AUTS is better than Facebook Prophet at handling shifts in the trend and other types of outliers that we encounter with business forecast tasks, which affects the accuracy of the models and consequently the actual coverage percentage of the prediction intervals.

In addition, for both models, the average coverage percentage in each data granularity is less than the confidence level of 80 percent that was set. This finding is consistent with the narrow-intervals phenomenon reported in the Hyndman study [14]. Prediction intervals for forecasts are well known to be usually too narrow [13]. For example, in Hyndman’s paper [14] they found that prediction intervals calculated for 95 percent of coverage of the real observation may only provide between 71 and 87 percent of coverage. The main reason for this well-known phenomenon is that not all sources of uncertainty are considered.

An interesting pattern emerging from the prediction interval success plots is that there are forecasts that have 0 percent of real observations within the prediction interval, which can be caused due to missing sources of uncertainty and affected by the forecast horizon size. A common feature of prediction intervals is that they get wider as the forecast horizon increases. The further out in time that we forecast, the greater the uncertainty associated with the forecast and thus the wider the prediction intervals and the greater the coverage percentage. For example, compare forecasts generated using monthly data and a seven-month time horizon with forecasts created using daily data and a one-month time horizon. For both models, the number of forecasts whose coverage percentage is zero is lower when the forecasts are in the monthly granularity.

Figure 10: Prediction intervals success.

Use cases

Subscription use cases that illustrate some of the challenges we face are presented below. The use cases are for three different subscriptions and are not necessarily indicative of the characteristics of all time series. For each use case we compare performance of the top three models. The plots illustrate that AUTS deals better with outliers, trend changes, and lack of historical data. For example, the forecast use case in weekly granularity Illustrates that AUTS outperforms the two other models, although less than a month of historical data (November 15, 2019 – December 6, 2019) was used to train the model.

Conclusions

In this article we shared the methodology of our univariate forecast model and high level architecture of AUTS. We have also shared AUTS evaluation results for one of our own business forecasting scenarios in which we provide significant assistance and insights to teams we work with across Microsoft.

We hope this article, and the series it is part of, helps you with your own business problems. Please leave a comment to share your forecasting scenarios and the techniques you are using today.

We would like to thank Casey Doyle for helping review the work.

References

[1]. https://otexts.com/fpp2/

[2]. Meyer D (2002). “Naive Time Series Forecasting Methods.” R News, 2(2), 7–10. URL http://CRAN.R-project.org/doc/Rnews/.

[3]. Hyndman RJ, Koehler AB, Snyder RD, Grose S (2002). “A State Space Framework for Automatic Forecasting Using Exponential Smoothing Methods.” International Journal of Forecasting, 18(3), 439–454.

[4]. Sean J. Taylor, L. B. (2018). Forecasting at Scale. The American Statistician Journal.

[5]. Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting: The forecast package for R. Journal of Statistical Software, 27(1), 1–22

[6]. Sepp Hochreiter, J. S. (2006). Long Short-Term Memory. arxiv.

[7]. Harvey, A. & Peters, S. (1990), ‘Estimation procedures for structural time series models’, Journal of Forecasting 9, 89–108.

[8]. Tashman, L. J. & Leach, M. L. (1991), ‘Automatic forecasting software: a survey and evaluation’, International Journal of Forecasting 7, 209–230.

[9]. De Gooijer, J. G. & Hyndman, R. J. (2006), ’25 years of time series forecasting’, International Journal of Forecasting 22(3), 443–473.

[10]. Theil, H. (1950), “A rank-invariant method of linear and polynomial regression analysis. I, II, III”, Nederl. Akad. Wetensch., Proc., 53: 386–392, 521–525, 1397–1412.

[11]. Sen, P (1968). Estimated of the regression coefficient based on Kendall’s Tau. J Am Stat Assoc 39:1379–1389

[12]. https://docs.microsoft.com/en-us/azure/data-explorer/anomaly-detection

[13]. https://robjhyndman.com/hyndsight/narrow-pi/

[14]. Rob J Hyndman, Anne B Koehler , Ralph D Snyder, Simone Grose (2002) International Journal of Forecasting 18(3), 439–454

--

--