Forecasting Time Series - Evaluation Metrics¶
Picking the right evaluation metric is one of the most important choices when using an AutoML framework. This page lists the forecast evaluation metrics available in AutoGluon, explains when different metrics should be used, and describes how to define custom evaluation metrics.
When using AutoGluon, you can specify the metric using the eval_metric
argument to TimeSeriesPredictor
, for example:
from autogluon.timeseries import TimeSeriesPredictor
predictor = TimeSeriesPredictor(eval_metric="MASE")
AutoGluon will use the provided metric to tune model hyperparameters, rank models, and construct the final ensemble for prediction.
Note
AutoGluon always reports all metrics in a higher-is-better format.
For this purpose, some metrics are multiplied by -1.
For example, if we set eval_metric="MASE"
, the predictor will actually report -MASE
(i.e., MASE score multiplied by -1). This means the test_score
will be between 0 (most accurate forecast) and \(-\infty\) (least accurate forecast).
Currently, AutoGluon supports following evaluation metrics:
Scaled quantile loss. |
|
Weighted quantile loss. |
|
Mean absolute error. |
|
Mean absolute percentage error. |
|
Mean absolute scaled error. |
|
Mean squared error. |
|
Root mean squared error. |
|
Root mean squared logarithmic error. |
|
Root mean squared scaled error. |
|
Symmetric mean absolute percentage error. |
|
Weighted absolute percentage error. |
Alternatively, you can define a custom forecast evaluation metric.
Which evaluation metric to choose?¶
If you are not sure which evaluation metric to pick, here are three questions that can help you make the right choice for your use case.
1. Are you interested in a point forecast or a probabilistic forecast?
If your goal is to generate an accurate probabilistic forecast, you should use WQL
or SQL
metrics.
These metrics are based on the quantile loss and measure the accuracy of the quantile forecasts.
By default, AutoGluon predicts quantile levels [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
.
To predict a different set of quantiles, you can use quantile_levels
argument:
predictor = TimeSeriesPredictor(eval_metric="WQL", quantile_levels=[0.1, 0.5, 0.75, 0.9])
All remaining forecast metrics described on this page are point forecast metrics.
Note that if you select the eval_metric
to a point forecast metric when creating the TimeSeriesPredictor
, then the forecast minimizing this metric will always be provided in the "mean"
column of the predictions data frame.
2. Do you care more about accurately predicting time series with large values?
If the answer is “yes” (for example, if it’s important to more accurately predict sales of popular products), you should use scale-dependent metrics like WQL
, MAE
, RMSE
, or WAPE
.
These metrics are also well-suited for dealing with sparse (intermittent) time series that have lots of zeros.
If the answer is “no” (you care equally about all time series in the dataset), consider scaled metrics like SQL
, MASE
and RMSSE
. Alternatively, percentage-based metrics MAPE
and SMAPE
can also be used to equalize the scale across time series. However, these percentage-based metrics have some well-documented limitations, so we don’t recommend using them in practice.
Note that both scaled and percentage-based metrics are poorly suited for sparse (intermittent) data.
3. (Point forecast only) Do you want to estimate the mean or the median?
To estimate the median, you need to use metrics such as MAE
, MASE
or WAPE
.
If your goal is to predict the mean (expected value), you should use MSE
, RMSE
or RMSSE
metrics.
Point forecast metrics¶
We use the following notation in mathematical definitions of point forecast metrics:
\(y_{i,t}\) - observed value of time series \(i\) at time \(t\)
\(f_{i,t}\) - predicted value of time series \(i\) at time \(t\)
\(N\) - number of time series (number of items) in the dataset
\(T\) - length of the observed time series
\(H\) - length of the forecast horizon (
prediction_length
)
- class autogluon.timeseries.metrics.MAE[source]¶
Mean absolute error.
(1)¶\[\operatorname{MAE} = \frac{1}{N} \frac{1}{H} \sum_{i=1}^{N}\sum_{t=T+1}^{T+H} |y_{i,t} - f_{i,t}|\]Properties:
scale-dependent (time series with large absolute value contribute more to the loss)
not sensitive to outliers
prefers models that accurately estimate the median
References
- class autogluon.timeseries.metrics.MAPE[source]¶
Mean absolute percentage error.
(2)¶\[\operatorname{MAPE} = \frac{1}{N} \frac{1}{H} \sum_{i=1}^{N} \sum_{t=T+1}^{T+H} \frac{ |y_{i,t} - f_{i,t}|}{|y_{i,t}|}\]Properties:
should only be used if all time series have positive values
undefined for time series that contain zero values
penalizes overprediction more heavily than underprediction
References
- class autogluon.timeseries.metrics.MASE[source]¶
Mean absolute scaled error.
Normalizes the absolute error for each time series by the historic seasonal error of this time series.
(3)¶\[\operatorname{MASE} = \frac{1}{N} \frac{1}{H} \sum_{i=1}^{N} \frac{1}{a_i} \sum_{t=T+1}^{T+H} |y_{i,t} - f_{i,t}|\]where \(a_i\) is the historic absolute seasonal error defined as
(4)¶\[a_i = \frac{1}{T-m} \sum_{t=m+1}^T |y_{i,t} - y_{i,t-m}|\]and \(m\) is the seasonal period of the time series (
eval_metric_seasonal_period
).Properties:
scaled metric (normalizes the error for each time series by the scale of that time series)
undefined for constant time series
not sensitive to outliers
prefers models that accurately estimate the median
References
- class autogluon.timeseries.metrics.MSE[source]¶
Mean squared error.
Using this metric will lead to forecast of the mean.
(5)¶\[\operatorname{MSE} = \frac{1}{N} \frac{1}{H} \sum_{i=1}^{N}\sum_{t=T+1}^{T+H} (y_{i,t} - f_{i,t})^2\]Properties:
scale-dependent (time series with large absolute value contribute more to the loss)
heavily penalizes models that cannot quickly adapt to abrupt changes in the time series
sensitive to outliers
prefers models that accurately estimate the mean (expected value)
References
- class autogluon.timeseries.metrics.RMSE[source]¶
Root mean squared error.
(6)¶\[\operatorname{RMSE} = \sqrt{\frac{1}{N} \frac{1}{H} \sum_{i=1}^{N}\sum_{t=T+1}^{T+H} (y_{i,t} - f_{i,t})^2}\]Properties:
scale-dependent (time series with large absolute value contribute more to the loss)
heavily penalizes models that cannot quickly adapt to abrupt changes in the time series
sensitive to outliers
prefers models that accurately estimate the mean (expected value)
References
- class autogluon.timeseries.metrics.RMSLE[source]¶
Root mean squared logarithmic error.
Applies a logarithmic transformation to the predictions before computing the root mean squared error. Assumes both the ground truth and predictions are positive. If negative predictions are given, they will be clipped to zero.
(7)¶\[\operatorname{RMSLE} = \sqrt{\frac{1}{N} \frac{1}{H} \sum_{i=1}^{N} \sum_{t=T+1}^{T+H} (\ln(1 + y_{i,t}) - \ln(1 + f_{i,t}))^2}\]Properties:
undefined for time series with negative values
penalizes models that underpredict more than models that overpredict
insensitive to effects of outliers and scale, best when targets can vary or trend exponentially
References
- class autogluon.timeseries.metrics.RMSSE[source]¶
Root mean squared scaled error.
Normalizes the absolute error for each time series by the historic seasonal error of this time series.
(8)¶\[\operatorname{RMSSE} = \sqrt{\frac{1}{N} \frac{1}{H} \sum_{i=1}^{N} \frac{1}{s_i} \sum_{t=T+1}^{T+H} (y_{i,t} - f_{i,t})^2}\]where \(s_i\) is the historic squared seasonal error defined as
(9)¶\[s_i = \frac{1}{T-m} \sum_{t=m+1}^T (y_{i,t} - y_{i,t-m})^2\]and \(m\) is the seasonal period of the time series (
eval_metric_seasonal_period
).Properties:
scaled metric (normalizes the error for each time series by the scale of that time series)
undefined for constant time series
heavily penalizes models that cannot quickly adapt to abrupt changes in the time series
sensitive to outliers
prefers models that accurately estimate the mean (expected value)
References
- class autogluon.timeseries.metrics.SMAPE[source]¶
Symmetric mean absolute percentage error.
(10)¶\[\operatorname{SMAPE} = 2 \frac{1}{N} \frac{1}{H} \sum_{i=1}^{N} \sum_{t=T+1}^{T+H} \frac{ |y_{i,t} - f_{i,t}|}{|y_{i,t}| + |f_{i,t}|}\]Properties:
should only be used if all time series have positive values
poorly suited for sparse & intermittent time series that contain zero values
penalizes overprediction more heavily than underprediction
References
- class autogluon.timeseries.metrics.WAPE[source]¶
Weighted absolute percentage error.
Defined as sum of absolute errors divided by the sum of absolute time series values in the forecast horizon.
(11)¶\[\operatorname{WAPE} = \frac{1}{\sum_{i=1}^{N} \sum_{t=T+1}^{T+H} |y_{i, t}|} \sum_{i=1}^{N} \sum_{t=T+1}^{T+H} |y_{i,t} - f_{i,t}|\]Properties:
scale-dependent (time series with large absolute value contribute more to the loss)
not sensitive to outliers
prefers models that accurately estimate the median
References
Probabilistic forecast metrics¶
In addition to the notation listed above, we use following notation to define probabilistic forecast metrics:
\(f_{i,t}^q\) - predicted quantile \(q\) of time series \(i\) at time \(t\)
\(\rho_q(y, f) \) - quantile loss at level \(q\) defined as
- class autogluon.timeseries.metrics.SQL[source]¶
Scaled quantile loss.
Also known as scaled pinball loss.
Normalizes the quantile loss for each time series by the historic seasonal error of this time series.
(12)¶\[\operatorname{SQL} = \frac{1}{N} \frac{1}{H} \sum_{i=1}^{N} \frac{1}{a_i} \sum_{t=T+1}^{T+H} \sum_{q} \rho_q(y_{i,t}, f^q_{i,t})\]where \(a_i\) is the historic absolute seasonal error defined as
(13)¶\[a_i = \frac{1}{T-m} \sum_{t=m+1}^T |y_{i,t} - y_{i,t-m}|\]and \(m\) is the seasonal period of the time series (
eval_metric_seasonal_period
).Properties:
scaled metric (normalizes the error for each time series by the scale of that time series)
undefined for constant time series
equivalent to MASE if
quantile_levels = [0.5]
References
- class autogluon.timeseries.metrics.WQL[source]¶
Weighted quantile loss.
Also known as weighted pinball loss.
Defined as total quantile loss divided by the sum of absolute time series values in the forecast horizon.
(14)¶\[\operatorname{WQL} = \frac{1}{\sum_{i=1}^{N} \sum_{t=T+1}^{T+H} |y_{i, t}|} \sum_{i=1}^{N} \sum_{t=T+1}^{T+H} \sum_{q} \rho_q(y_{i,t}, f^q_{i,t})\]Properties:
scale-dependent (time series with large absolute value contribute more to the loss)
equivalent to WAPE if
quantile_levels = [0.5]
References
Custom forecast metrics¶
If none of the built-in metrics meet your requirements, you can provide a custom evaluation metric to AutoGluon.
To define a custom metric, you need to create a class that inherits from TimeSeriesScorer
and implements the compute_metric
method according to the following API specification:
- TimeSeriesScorer.compute_metric(data_future: TimeSeriesDataFrame, predictions: TimeSeriesDataFrame, target: str = 'target', **kwargs) float [source]¶
Internal method that computes the metric for given forecast & actual data.
This method should be implemented by all custom metrics.
- Parameters:
data_future (TimeSeriesDataFrame) – Actual values of the time series during the forecast horizon (
prediction_length
values for each time series in the dataset). Must have the same index aspredictions
.predictions (TimeSeriesDataFrame) – Data frame with predictions for the forecast horizon. Contain columns “mean” (point forecast) and the columns corresponding to each of the quantile levels. Must have the same index as
data_future
.target (str, default = "target") – Name of the column in
data_future
that contains the target time series.
- Returns:
score – Value of the metric for given forecast and data. If self.greater_is_better_internal is True, returns score in greater-is-better format, otherwise in lower-is-better format.
- Return type:
float
Custom mean squared error metric¶
Here is an example of how you can define a custom mean squared error (MSE) metric using TimeSeriesScorer
.
import sklearn.metrics
from autogluon.timeseries.metrics import TimeSeriesScorer
class MeanSquaredError(TimeSeriesScorer):
greater_is_better_internal = False
optimum = 0.0
def compute_metric(self, data_future, predictions, target, **kwargs):
return sklearn.metrics.mean_squared_error(y_true=data_future[target], y_pred=predictions["mean"])
The internal method compute_metric
returns metric in lower-is-better format, so we need to set greater_is_better_internal=False
.
This will tell AutoGluon that the metric value must be multiplied by -1
to convert it to greater-is-better format.
Note
Custom metrics must be defined in a separate Python file and imported so that they can be pickled (Python’s serialization protocol). If a custom metric is not picklable, AutoGluon may crash during fit if you enable hyperparameter tuning. In the above example, you would want to create a new python file such as my_metrics.py
with class MeanSquaredError
defined in it, and then use it via from my_metrics import MeanSquaredError
.
We can use the custom metric to measure accuracy of a forecast generated by the predictor.
import pandas as pd
from autogluon.timeseries import TimeSeriesPredictor, TimeSeriesDataFrame
# Create dummy dataset
data = TimeSeriesDataFrame.from_iterable_dataset(
[
{"start": pd.Period("2023-01-01", freq="D"), "target": list(range(15))},
{"start": pd.Period("2023-01-01", freq="D"), "target": list(range(30, 45))},
]
)
prediction_length = 3
train_data, test_data = data.train_test_split(prediction_length=prediction_length)
predictor = TimeSeriesPredictor(prediction_length=prediction_length, verbosity=0).fit(train_data, hyperparameters={"Naive": {}})
predictions = predictor.predict(train_data)
mse = MeanSquaredError()
mse_score = mse(
data=test_data,
predictions=predictions,
prediction_length=predictor.prediction_length,
target=predictor.target,
)
print(f"{mse.name_with_sign} = {mse_score}")
-MeanSquaredError = -4.666666666666667
Note that the metric value has been multiplied by -1
because we set greater_is_better_internal=False
when defining the metric.
When we call the metric, TimeSeriesScorer
takes care of splitting test_data
into past & future parts, validating that predictions
have correct timestamps, and ensuring that the score is reported in greater-is-better format.
During the metric call, the method compute_metric
that we implemented receives as input the following arguments:
Test data corresponding to the forecast horizon
data_future = test_data.slice_by_timestep(-prediction_length, None)
data_future
target | ||
---|---|---|
item_id | timestamp | |
0 | 2023-01-13 | 12 |
2023-01-14 | 13 | |
2023-01-15 | 14 | |
1 | 2023-01-13 | 42 |
2023-01-14 | 43 | |
2023-01-15 | 44 |
Predictions for the forecast horizon
predictions.round(2)
mean | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | ||
---|---|---|---|---|---|---|---|---|---|---|---|
item_id | timestamp | ||||||||||
0 | 2023-01-13 | 11.0 | 9.72 | 10.16 | 10.48 | 10.75 | 11.0 | 11.25 | 11.52 | 11.84 | 12.28 |
2023-01-14 | 11.0 | 9.19 | 9.81 | 10.26 | 10.64 | 11.0 | 11.36 | 11.74 | 12.19 | 12.81 | |
2023-01-15 | 11.0 | 8.78 | 9.54 | 10.09 | 10.56 | 11.0 | 11.44 | 11.91 | 12.46 | 13.22 | |
1 | 2023-01-13 | 41.0 | 39.72 | 40.16 | 40.48 | 40.75 | 41.0 | 41.25 | 41.52 | 41.84 | 42.28 |
2023-01-14 | 41.0 | 39.19 | 39.81 | 40.26 | 40.64 | 41.0 | 41.36 | 41.74 | 42.19 | 42.81 | |
2023-01-15 | 41.0 | 38.78 | 39.54 | 40.09 | 40.56 | 41.0 | 41.44 | 41.91 | 42.46 | 43.22 |
Note that both data_future
and predictions
cover the same time range.
Custom quantile loss metric¶
The metric can be computed on any columns of the predictions
DataFrame.
For example, here is how we can define the mean quantile loss metric that measures the accuracy of the quantile forecast.
class MeanQuantileLoss(TimeSeriesScorer):
needs_quantile = True
greater_is_better_internal = False
optimum = 0.0
def compute_metric(self, data_future, predictions, target, **kwargs):
quantile_columns = [col for col in predictions if col != "mean"]
total_quantile_loss = 0.0
for q in quantile_columns:
total_quantile_loss += sklearn.metrics.mean_pinball_loss(y_true=data_future[target], y_pred=predictions[q], alpha=float(q))
return total_quantile_loss / len(quantile_columns)
Here we set needs_quantile=True
to tell AutoGluon that this metric is evaluated on the quantile forecasts.
In this case, models such as DirectTabularModel
will train a TabularPredictor
with problem_type="quantile"
under the hood.
If needs_quantile=False
, these models would use problem_type="regression"
instead.
Custom mean absolute scaled error metric¶
Finally, here is how we can define the mean absolute scaled error (MASE) metric. Unlike previously discussed metrics, MASE is computed using both past and future time series values. The past values are used to compute the scale by which we normalize the error during the forecast horizon.
class MeanAbsoluteScaledError(TimeSeriesScorer):
greater_is_better_internal = False
optimum = 0.0
optimized_by_median = True
equivalent_tabular_regression_metric = "mean_absolute_error"
def save_past_metrics(
self, data_past: TimeSeriesDataFrame, target: str = "target", seasonal_period: int = 1, **kwargs
) -> None:
seasonal_diffs = data_past[target].groupby(level="item_id").diff(seasonal_period).abs()
self._abs_seasonal_error_per_item = seasonal_diffs.groupby(level="item_id").mean().fillna(1.0)
def clear_past_metrics(self):
self._abs_seasonal_error_per_item = None
def compute_metric(
self, data_future: TimeSeriesDataFrame, predictions: TimeSeriesDataFrame, target: str = "target", **kwargs
) -> float:
mae_per_item = (data_future[target] - predictions["mean"]).abs().groupby(level="item_id").mean()
return (mae_per_item / self._abs_seasonal_error_per_item).mean()
We compute the metrics on past data using save_past_metrics
method.
Doing this in a separate method allows AutoGluon to avoid redundant computations when fitting the weighted ensemble, which requires thousands of metric evaluations.
Because we set optimized_by_median=True
, AutoGluon will automatically paste the median forecast into the "mean"
column of predictions.
This is done for consistency: if TimeSeriesPredictor
is trained with a point forecast metric, the optimal point forecast will always be stored in the "mean"
column.
Finally, the equivalent_tabular_regression_metric
is used by forecasting models that fit TabularPredictor
under the hood.
Using custom metrics in TimeSeriesPredictor¶
Now that we have created several custom metrics, let’s use them for training and evaluating models.
predictor = TimeSeriesPredictor(eval_metric=MeanQuantileLoss()).fit(train_data, hyperparameters={"Naive": {}, "SeasonalNaive": {}, "Theta": {}})
Beginning AutoGluon training...
AutoGluon will save models to 'AutogluonModels/ag-20241030_200149'
=================== System Info ===================
AutoGluon Version: 1.1.1b20241030
Python Version: 3.10.13
Operating System: Linux
Platform Machine: x86_64
Platform Version: #1 SMP Tue Sep 24 10:00:37 UTC 2024
CPU Count: 8
GPU Count: 1
Memory Avail: 28.23 GB / 30.95 GB (91.2%)
Disk Space Avail: 214.96 GB / 255.99 GB (84.0%)
===================================================
Fitting with arguments:
{'enable_ensemble': True,
'eval_metric': MeanQuantileLoss,
'hyperparameters': {'Naive': {}, 'SeasonalNaive': {}, 'Theta': {}},
'known_covariates_names': [],
'num_val_windows': 1,
'prediction_length': 1,
'quantile_levels': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
'random_seed': 123,
'refit_every_n_windows': 1,
'refit_full': False,
'skip_model_selection': False,
'target': 'target',
'verbosity': 2}
Inferred time series frequency: 'D'
Provided train_data has 24 rows, 2 time series. Median time series length is 12 (min=12, max=12).
Provided data contains following columns:
target: 'target'
AutoGluon will gauge predictive performance using evaluation metric: 'MeanQuantileLoss'
This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
===================================================
Starting training. Start time is 2024-10-30 20:01:49
Models that will be trained: ['Naive', 'SeasonalNaive', 'Theta']
Training timeseries model Naive.
-0.3323 = Validation score (-MeanQuantileLoss)
0.02 s = Training runtime
0.02 s = Validation (prediction) runtime
Training timeseries model SeasonalNaive.
-2.3263 = Validation score (-MeanQuantileLoss)
0.02 s = Training runtime
0.02 s = Validation (prediction) runtime
Training timeseries model Theta.
-0.2525 = Validation score (-MeanQuantileLoss)
0.02 s = Training runtime
24.64 s = Validation (prediction) runtime
Fitting simple weighted ensemble.
Ensemble weights: {'Theta': 1.0}
-0.2525 = Validation score (-MeanQuantileLoss)
2.35 s = Training runtime
24.64 s = Validation (prediction) runtime
Training complete. Models trained: ['Naive', 'SeasonalNaive', 'Theta', 'WeightedEnsemble']
Total runtime: 27.12 s
Best model: Theta
Best model score: -0.2525
We can also evaluate a trained predictor using these custom metrics
predictor.evaluate(test_data, metrics=[MeanAbsoluteScaledError(), MeanQuantileLoss(), MeanSquaredError()])
Model not specified in predict, will default to the model with the best validation score: Theta
{'MeanAbsoluteScaledError': -0.07215009416852679,
'MeanQuantileLoss': -0.25252532958984375,
'MeanSquaredError': -0.25507616833783686}
That’s all it takes to create and use custom forecasting metrics in AutoGluon!
You can have a look at the AutoGluon source code for example implementations of point and quantile forecasting metrics.
If you create a custom metric, consider submitting a PR so that we can officially add it to AutoGluon.
For more tutorials, refer to Forecasting Time Series - Quick Start and Forecasting Time Series - In Depth.