Forecasting Time Series - Evaluation Metrics¶

Picking the right evaluation metric is one of the most important choices when using an AutoML framework. This page lists the forecast evaluation metrics available in AutoGluon, explains when different metrics should be used, and describes how to define custom evaluation metrics.

When using AutoGluon, you can specify the metric using the eval_metric argument to TimeSeriesPredictor, for example:

from autogluon.timeseries import TimeSeriesPredictor

predictor = TimeSeriesPredictor(eval_metric="MASE")

AutoGluon will use the provided metric to tune model hyperparameters, rank models, and construct the final ensemble for prediction.

Note

AutoGluon always reports all metrics in a higher-is-better format. For this purpose, some metrics are multiplied by -1. For example, if we set eval_metric="MASE", the predictor will actually report -MASE (i.e., MASE score multiplied by -1). This means the test_score will be between 0 (most accurate forecast) and \(-\infty\) (least accurate forecast).

Currently, AutoGluon supports following evaluation metrics:

`SQL`	Scaled quantile loss.
`WQL`	Weighted quantile loss.
`MAE`	Mean absolute error.
`MAPE`	Mean absolute percentage error.
`MASE`	Mean absolute scaled error.
`MSE`	Mean squared error.
`RMSE`	Root mean squared error.
`RMSLE`	Root mean squared logarithmic error.
`RMSSE`	Root mean squared scaled error.
`SMAPE`	Symmetric mean absolute percentage error.
`WAPE`	Weighted absolute percentage error.

Alternatively, you can define a custom forecast evaluation metric.

Which evaluation metric to choose?¶

If you are not sure which evaluation metric to pick, here are three questions that can help you make the right choice for your use case.

1. Are you interested in a point forecast or a probabilistic forecast?

If your goal is to generate an accurate probabilistic forecast, you should use WQL or SQL metrics. These metrics are based on the quantile loss and measure the accuracy of the quantile forecasts. By default, AutoGluon predicts quantile levels [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]. To predict a different set of quantiles, you can use quantile_levels argument:

predictor = TimeSeriesPredictor(eval_metric="WQL", quantile_levels=[0.1, 0.5, 0.75, 0.9])

All remaining forecast metrics described on this page are point forecast metrics. Note that if you select the eval_metric to a point forecast metric when creating the TimeSeriesPredictor, then the forecast minimizing this metric will always be provided in the "mean" column of the predictions data frame.

2. Do you care more about accurately predicting time series with large values?

If the answer is “yes” (for example, if it’s important to more accurately predict sales of popular products), you should use scale-dependent metrics like WQL, MAE, RMSE, or WAPE. These metrics are also well-suited for dealing with sparse (intermittent) time series that have lots of zeros.

If the answer is “no” (you care equally about all time series in the dataset), consider scaled metrics like SQL, MASE and RMSSE. Alternatively, percentage-based metrics MAPE and SMAPE can also be used to equalize the scale across time series. However, these percentage-based metrics have some well-documented limitations, so we don’t recommend using them in practice. Note that both scaled and percentage-based metrics are poorly suited for sparse (intermittent) data.

3. (Point forecast only) Do you want to estimate the mean or the median?

To estimate the median, you need to use metrics such as MAE, MASE or WAPE. If your goal is to predict the mean (expected value), you should use MSE, RMSE or RMSSE metrics.

Metric	Probabilistic?	Scale-dependent?	Predicts median or mean?
`SQL`	✅
`WQL`	✅	✅
`MAE`		✅	median
`MASE`			median
`WAPE`		✅	median
`MSE`		✅	mean
`RMSE`		✅	mean
`RMSLE`
`RMSSE`			mean
`MAPE`
`SMAPE`

Point forecast metrics¶

We use the following notation in mathematical definitions of point forecast metrics:

\(y_{i,t}\) - observed value of time series \(i\) at time \(t\)
\(f_{i,t}\) - predicted value of time series \(i\) at time \(t\)
\(N\) - number of time series (number of items) in the dataset
\(T\) - length of the observed time series
\(H\) - length of the forecast horizon (prediction_length)

class autogluon.timeseries.metrics.MAE[source]¶

Mean absolute error.

(1)¶\[\operatorname{MAE} = \frac{1}{N} \frac{1}{H} \sum_{i=1}^{N}\sum_{t=T+1}^{T+H} |y_{i,t} - f_{i,t}|\]

Properties:

scale-dependent (time series with large absolute value contribute more to the loss)
not sensitive to outliers
prefers models that accurately estimate the median

References

class autogluon.timeseries.metrics.MAPE[source]¶

Mean absolute percentage error.

(2)¶\[\operatorname{MAPE} = \frac{1}{N} \frac{1}{H} \sum_{i=1}^{N} \sum_{t=T+1}^{T+H} \frac{ |y_{i,t} - f_{i,t}|}{|y_{i,t}|}\]

Properties:

should only be used if all time series have positive values
undefined for time series that contain zero values
penalizes overprediction more heavily than underprediction

References

class autogluon.timeseries.metrics.MASE[source]¶

Mean absolute scaled error.

Normalizes the absolute error for each time series by the historic seasonal error of this time series.

(3)¶\[\operatorname{MASE} = \frac{1}{N} \frac{1}{H} \sum_{i=1}^{N} \frac{1}{a_i} \sum_{t=T+1}^{T+H} |y_{i,t} - f_{i,t}|\]

where \(a_i\) is the historic absolute seasonal error defined as

(4)¶\[a_i = \frac{1}{T-m} \sum_{t=m+1}^T |y_{i,t} - y_{i,t-m}|\]

and \(m\) is the seasonal period of the time series (eval_metric_seasonal_period).

Properties:

scaled metric (normalizes the error for each time series by the scale of that time series)
undefined for constant time series
not sensitive to outliers
prefers models that accurately estimate the median

References

class autogluon.timeseries.metrics.MSE[source]¶

Mean squared error.

Using this metric will lead to forecast of the mean.

(5)¶\[\operatorname{MSE} = \frac{1}{N} \frac{1}{H} \sum_{i=1}^{N}\sum_{t=T+1}^{T+H} (y_{i,t} - f_{i,t})^2\]

Properties:

scale-dependent (time series with large absolute value contribute more to the loss)
heavily penalizes models that cannot quickly adapt to abrupt changes in the time series
sensitive to outliers
prefers models that accurately estimate the mean (expected value)

References

Wikipedia

class autogluon.timeseries.metrics.RMSE[source]¶

Root mean squared error.

(6)¶\[\operatorname{RMSE} = \sqrt{\frac{1}{N} \frac{1}{H} \sum_{i=1}^{N}\sum_{t=T+1}^{T+H} (y_{i,t} - f_{i,t})^2}\]

Properties:

scale-dependent (time series with large absolute value contribute more to the loss)
heavily penalizes models that cannot quickly adapt to abrupt changes in the time series
sensitive to outliers
prefers models that accurately estimate the mean (expected value)

References

class autogluon.timeseries.metrics.RMSLE[source]¶

Root mean squared logarithmic error.

Applies a logarithmic transformation to the predictions before computing the root mean squared error. Assumes both the ground truth and predictions are positive. If negative predictions are given, they will be clipped to zero.

(7)¶\[\operatorname{RMSLE} = \sqrt{\frac{1}{N} \frac{1}{H} \sum_{i=1}^{N} \sum_{t=T+1}^{T+H} (\ln(1 + y_{i,t}) - \ln(1 + f_{i,t}))^2}\]

Properties:

undefined for time series with negative values
penalizes models that underpredict more than models that overpredict
insensitive to effects of outliers and scale, best when targets can vary or trend exponentially

References

Scikit-learn:

class autogluon.timeseries.metrics.RMSSE[source]¶

Root mean squared scaled error.

Normalizes the absolute error for each time series by the historic seasonal error of this time series.

(8)¶\[\operatorname{RMSSE} = \sqrt{\frac{1}{N} \frac{1}{H} \sum_{i=1}^{N} \frac{1}{s_i} \sum_{t=T+1}^{T+H} (y_{i,t} - f_{i,t})^2}\]

where \(s_i\) is the historic squared seasonal error defined as

(9)¶\[s_i = \frac{1}{T-m} \sum_{t=m+1}^T (y_{i,t} - y_{i,t-m})^2\]

and \(m\) is the seasonal period of the time series (eval_metric_seasonal_period).

Properties:

scaled metric (normalizes the error for each time series by the scale of that time series)
undefined for constant time series
heavily penalizes models that cannot quickly adapt to abrupt changes in the time series
sensitive to outliers
prefers models that accurately estimate the mean (expected value)

References

Forecasting: Principles and Practice

class autogluon.timeseries.metrics.SMAPE[source]¶

Symmetric mean absolute percentage error.

(10)¶\[\operatorname{SMAPE} = 2 \frac{1}{N} \frac{1}{H} \sum_{i=1}^{N} \sum_{t=T+1}^{T+H} \frac{ |y_{i,t} - f_{i,t}|}{|y_{i,t}| + |f_{i,t}|}\]

Properties:

should only be used if all time series have positive values
poorly suited for sparse & intermittent time series that contain zero values
penalizes overprediction more heavily than underprediction

References

class autogluon.timeseries.metrics.WAPE[source]¶

Weighted absolute percentage error.

Defined as sum of absolute errors divided by the sum of absolute time series values in the forecast horizon.

(11)¶\[\operatorname{WAPE} = \frac{1}{\sum_{i=1}^{N} \sum_{t=T+1}^{T+H} |y_{i, t}|} \sum_{i=1}^{N} \sum_{t=T+1}^{T+H} |y_{i,t} - f_{i,t}|\]

Properties:

scale-dependent (time series with large absolute value contribute more to the loss)
not sensitive to outliers
prefers models that accurately estimate the median

References

Wikipedia

Probabilistic forecast metrics¶

In addition to the notation listed above, we use following notation to define probabilistic forecast metrics:

\(f_{i,t}^q\) - predicted quantile \(q\) of time series \(i\) at time \(t\)
\(\rho_q(y, f) \) - quantile loss at level \(q\) defined as

\[\begin{split} \rho_q(y_{i,t}, f_{i,t}^q) = \begin{cases} 2 \cdot (1 - q) \cdot (f^q_{i,t} - y_{i,t}), & \text{ if } y_{i,t} < f_{i,t}^q\\ 2 \cdot q \cdot (y_{i,t} - f^q_{i,t} ), & \text{ if } y_{i,t} \ge f_{i,t}^q\\ \end{cases} \end{split}\]

class autogluon.timeseries.metrics.SQL[source]¶

Scaled quantile loss.

Also known as scaled pinball loss.

Normalizes the quantile loss for each time series by the historic seasonal error of this time series.

(12)¶\[\operatorname{SQL} = \frac{1}{N} \frac{1}{H} \sum_{i=1}^{N} \frac{1}{a_i} \sum_{t=T+1}^{T+H} \sum_{q} \rho_q(y_{i,t}, f^q_{i,t})\]

where \(a_i\) is the historic absolute seasonal error defined as

(13)¶\[a_i = \frac{1}{T-m} \sum_{t=m+1}^T |y_{i,t} - y_{i,t-m}|\]

and \(m\) is the seasonal period of the time series (eval_metric_seasonal_period).

Properties:

scaled metric (normalizes the error for each time series by the scale of that time series)
undefined for constant time series
equivalent to MASE if quantile_levels = [0.5]

References

Forecasting: Principles and Practice

class autogluon.timeseries.metrics.WQL[source]¶

Weighted quantile loss.

Also known as weighted pinball loss.

Defined as total quantile loss divided by the sum of absolute time series values in the forecast horizon.

(14)¶\[\operatorname{WQL} = \frac{1}{\sum_{i=1}^{N} \sum_{t=T+1}^{T+H} |y_{i, t}|} \sum_{i=1}^{N} \sum_{t=T+1}^{T+H} \sum_{q} \rho_q(y_{i,t}, f^q_{i,t})\]

Properties:

scale-dependent (time series with large absolute value contribute more to the loss)
equivalent to WAPE if quantile_levels = [0.5]

References

Forecasting: Principles and Practice

Custom forecast metrics¶

If none of the built-in metrics meet your requirements, you can provide a custom evaluation metric to AutoGluon. To define a custom metric, you need to create a class that inherits from TimeSeriesScorer and implements the compute_metric method according to the following API specification:

TimeSeriesScorer.compute_metric(data_future: TimeSeriesDataFrame, predictions: TimeSeriesDataFrame, target: str = 'target', **kwargs) → float[source]¶

Internal method that computes the metric for given forecast & actual data.

This method should be implemented by all custom metrics.

Parameters:

data_future (TimeSeriesDataFrame) – Actual values of the time series during the forecast horizon (prediction_length values for each time series in the dataset). Must have the same index as predictions.
predictions (TimeSeriesDataFrame) – Data frame with predictions for the forecast horizon. Contain columns “mean” (point forecast) and the columns corresponding to each of the quantile levels. Must have the same index as data_future.
target (str, default = "target") – Name of the column in data_future that contains the target time series.

Returns:

score – Value of the metric for given forecast and data. If self.greater_is_better_internal is True, returns score in greater-is-better format, otherwise in lower-is-better format.

Return type:

float

Custom mean squared error metric¶

Here is an example of how you can define a custom mean squared error (MSE) metric using TimeSeriesScorer.

import sklearn.metrics
from autogluon.timeseries.metrics import TimeSeriesScorer

class MeanSquaredError(TimeSeriesScorer):
   greater_is_better_internal = False
   optimum = 0.0

   def compute_metric(self, data_future, predictions, target, **kwargs):
      return sklearn.metrics.mean_squared_error(y_true=data_future[target], y_pred=predictions["mean"])

The internal method compute_metric returns metric in lower-is-better format, so we need to set greater_is_better_internal=False. This will tell AutoGluon that the metric value must be multiplied by -1 to convert it to greater-is-better format.

Note

Custom metrics must be defined in a separate Python file and imported so that they can be pickled (Python’s serialization protocol). If a custom metric is not picklable, AutoGluon may crash during fit if you enable hyperparameter tuning. In the above example, you would want to create a new python file such as my_metrics.py with class MeanSquaredError defined in it, and then use it via from my_metrics import MeanSquaredError.

We can use the custom metric to measure accuracy of a forecast generated by the predictor.

import pandas as pd
from autogluon.timeseries import TimeSeriesPredictor, TimeSeriesDataFrame

# Create dummy dataset
data = TimeSeriesDataFrame.from_iterable_dataset(
   [
       {"start": pd.Period("2023-01-01", freq="D"), "target": list(range(15))},
       {"start": pd.Period("2023-01-01", freq="D"), "target": list(range(30, 45))},
    ]
)
prediction_length = 3
train_data, test_data = data.train_test_split(prediction_length=prediction_length)
predictor = TimeSeriesPredictor(prediction_length=prediction_length, verbosity=0).fit(train_data, hyperparameters={"Naive": {}})
predictions = predictor.predict(train_data)

mse = MeanSquaredError()
mse_score = mse(
  data=test_data,
  predictions=predictions,
  prediction_length=predictor.prediction_length,
  target=predictor.target,
)
print(f"{mse.name_with_sign} = {mse_score}")

-MeanSquaredError = -4.666666666666667

Note that the metric value has been multiplied by -1 because we set greater_is_better_internal=False when defining the metric.

When we call the metric, TimeSeriesScorer takes care of splitting test_data into past & future parts, validating that predictions have correct timestamps, and ensuring that the score is reported in greater-is-better format.

During the metric call, the method compute_metric that we implemented receives as input the following arguments:

Test data corresponding to the forecast horizon

data_future = test_data.slice_by_timestep(-prediction_length, None)
data_future

		target
item_id	timestamp
0	2023-01-13	12
	2023-01-14	13
	2023-01-15	14
1	2023-01-13	42
	2023-01-14	43
	2023-01-15	44

Predictions for the forecast horizon

predictions.round(2)

		mean	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9
item_id	timestamp
0	2023-01-13	11.0	9.72	10.16	10.48	10.75	11.0	11.25	11.52	11.84	12.28
	2023-01-14	11.0	9.19	9.81	10.26	10.64	11.0	11.36	11.74	12.19	12.81
	2023-01-15	11.0	8.78	9.54	10.09	10.56	11.0	11.44	11.91	12.46	13.22
1	2023-01-13	41.0	39.72	40.16	40.48	40.75	41.0	41.25	41.52	41.84	42.28
	2023-01-14	41.0	39.19	39.81	40.26	40.64	41.0	41.36	41.74	42.19	42.81
	2023-01-15	41.0	38.78	39.54	40.09	40.56	41.0	41.44	41.91	42.46	43.22

Note that both data_future and predictions cover the same time range.

Custom quantile loss metric¶

The metric can be computed on any columns of the predictions DataFrame. For example, here is how we can define the mean quantile loss metric that measures the accuracy of the quantile forecast.

class MeanQuantileLoss(TimeSeriesScorer):
   needs_quantile = True
   greater_is_better_internal = False
   optimum = 0.0

   def compute_metric(self, data_future, predictions, target, **kwargs):
      quantile_columns = [col for col in predictions if col != "mean"]
      total_quantile_loss = 0.0
      for q in quantile_columns:
        total_quantile_loss += sklearn.metrics.mean_pinball_loss(y_true=data_future[target], y_pred=predictions[q], alpha=float(q))
      return total_quantile_loss / len(quantile_columns)

Here we set needs_quantile=True to tell AutoGluon that this metric is evaluated on the quantile forecasts. In this case, models such as DirectTabularModel will train a TabularPredictor with problem_type="quantile" under the hood. If needs_quantile=False, these models would use problem_type="regression" instead.

Custom mean absolute scaled error metric¶

Finally, here is how we can define the mean absolute scaled error (MASE) metric. Unlike previously discussed metrics, MASE is computed using both past and future time series values. The past values are used to compute the scale by which we normalize the error during the forecast horizon.

class MeanAbsoluteScaledError(TimeSeriesScorer):
  greater_is_better_internal = False
  optimum = 0.0
  optimized_by_median = True
  equivalent_tabular_regression_metric = "mean_absolute_error"

  def save_past_metrics(
      self, data_past: TimeSeriesDataFrame, target: str = "target", seasonal_period: int = 1, **kwargs
  ) -> None:
      seasonal_diffs = data_past[target].groupby(level="item_id").diff(seasonal_period).abs()
      self._abs_seasonal_error_per_item = seasonal_diffs.groupby(level="item_id").mean().fillna(1.0)

  def clear_past_metrics(self):
      self._abs_seasonal_error_per_item = None
  
  def compute_metric(
      self, data_future: TimeSeriesDataFrame, predictions: TimeSeriesDataFrame, target: str = "target", **kwargs
  ) -> float:
      mae_per_item = (data_future[target] - predictions["mean"]).abs().groupby(level="item_id").mean()
      return (mae_per_item / self._abs_seasonal_error_per_item).mean()

We compute the metrics on past data using save_past_metrics method. Doing this in a separate method allows AutoGluon to avoid redundant computations when fitting the weighted ensemble, which requires thousands of metric evaluations.

Because we set optimized_by_median=True, AutoGluon will automatically paste the median forecast into the "mean" column of predictions. This is done for consistency: if TimeSeriesPredictor is trained with a point forecast metric, the optimal point forecast will always be stored in the "mean" column. Finally, the equivalent_tabular_regression_metric is used by forecasting models that fit TabularPredictor under the hood.

Using custom metrics in TimeSeriesPredictor¶

Now that we have created several custom metrics, let’s use them for training and evaluating models.

predictor = TimeSeriesPredictor(eval_metric=MeanQuantileLoss()).fit(train_data, hyperparameters={"Naive": {}, "SeasonalNaive": {}, "Theta": {}})

Beginning AutoGluon training...
AutoGluon will save models to 'AutogluonModels/ag-20241030_200149'
=================== System Info ===================
AutoGluon Version:  1.1.1b20241030
Python Version:     3.10.13
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Tue Sep 24 10:00:37 UTC 2024
CPU Count:          8
GPU Count:          1
Memory Avail:       28.23 GB / 30.95 GB (91.2%)
Disk Space Avail:   214.96 GB / 255.99 GB (84.0%)
===================================================
Fitting with arguments:
{'enable_ensemble': True,
 'eval_metric': MeanQuantileLoss,
 'hyperparameters': {'Naive': {}, 'SeasonalNaive': {}, 'Theta': {}},
 'known_covariates_names': [],
 'num_val_windows': 1,
 'prediction_length': 1,
 'quantile_levels': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
 'random_seed': 123,
 'refit_every_n_windows': 1,
 'refit_full': False,
 'skip_model_selection': False,
 'target': 'target',
 'verbosity': 2}
Inferred time series frequency: 'D'
Provided train_data has 24 rows, 2 time series. Median time series length is 12 (min=12, max=12).
Provided data contains following columns:
target: 'target'
AutoGluon will gauge predictive performance using evaluation metric: 'MeanQuantileLoss'
This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
===================================================
Starting training. Start time is 2024-10-30 20:01:49
Models that will be trained: ['Naive', 'SeasonalNaive', 'Theta']
Training timeseries model Naive.
-0.3323       = Validation score (-MeanQuantileLoss)
0.02    s     = Training runtime
0.02    s     = Validation (prediction) runtime
Training timeseries model SeasonalNaive.
-2.3263       = Validation score (-MeanQuantileLoss)
0.02    s     = Training runtime
0.02    s     = Validation (prediction) runtime
Training timeseries model Theta.
-0.2525       = Validation score (-MeanQuantileLoss)
0.02    s     = Training runtime
24.64   s     = Validation (prediction) runtime
Fitting simple weighted ensemble.
Ensemble weights: {'Theta': 1.0}
-0.2525       = Validation score (-MeanQuantileLoss)
2.35    s     = Training runtime
24.64   s     = Validation (prediction) runtime
Training complete. Models trained: ['Naive', 'SeasonalNaive', 'Theta', 'WeightedEnsemble']
Total runtime: 27.12 s
Best model: Theta
Best model score: -0.2525

We can also evaluate a trained predictor using these custom metrics

predictor.evaluate(test_data, metrics=[MeanAbsoluteScaledError(), MeanQuantileLoss(), MeanSquaredError()])

Model not specified in predict, will default to the model with the best validation score: Theta

{'MeanAbsoluteScaledError': -0.07215009416852679,
 'MeanQuantileLoss': -0.25252532958984375,
 'MeanSquaredError': -0.25507616833783686}

That’s all it takes to create and use custom forecasting metrics in AutoGluon!

You can have a look at the AutoGluon source code for example implementations of point and quantile forecasting metrics.

If you create a custom metric, consider submitting a PR so that we can officially add it to AutoGluon.

For more tutorials, refer to Forecasting Time Series - Quick Start and Forecasting Time Series - In Depth.