Metrics for Distributional Forecasts

Published in

Trusted Data Science @ Haleon

13 min readFeb 27, 2023

In Haleon, distributional forecasts play a key role for our inventory optimisation solutions. In the past we relied on a range of quantile scores and point forecast metrics to evaluate these forecasts. We’ve been recently experimenting with metrics that take into account the entire predicted distribution. In this article we will go through a couple of these metrics (Log Score and CRPS) and will examine their properties.

Introduction: Point Forecasts and Distributional Forecasts

Why do we need forecasting? I’d say a good answer is, to support/automate decision making under uncertainty. Uncertainty is the key word here. A point forecast, usually shows the Expected Value (estimated by the mean) of the future of our time series. For any continuous distribution, the probability of a single point is just 0, even for the expected value. So what we’re really saying by showing a point forecast is “the values are going to be around this area”.

But what does around mean? What if the area of this “around” is huge? For example, at the outbreak of the COVID-19 pandemic, when we would see the ICU occupancy predictions, what should your local hospital do? Plan for the mean occupancy? You’d probably like to be covered for a realistic worst case scenario (e.g. the top 95% quantile of the forecasted distribution). Or in the case of inventory control, you might want to determine the stock required to maintain a particular service level. This is where distributional (also called probabilistic) forecasts enter the picture.

Example of a distributional forecast (grey areas) and a point forecast (orange line)

Distributional forecasts attempt to predict the distribution of the forecasted values — as opposed to a single point — and are common practice in areas like weather and infectious disease forecasting. A lot of the popular statistical forecasting algorithms, under a few assumptions (e.g. errors are independent and identically distributed and come from a Normal distribution) have analytical formulas for the forecast distribution. There are also non-parametric ways to get an empirical distribution for your forecast e.g. by bootstrapping the residuals [1].

To evaluate the performance of a forecasting algorithm we employ one of many time series cross-validation strategies and calculate performance using the predictions and the actuals for each fold.

Time series cross validation, expanding window with no overlap strategy.

The metrics for point forecasts are well known and the pros and cons of each are still a hot debate topic for academics and practitioners.

But what is a good metric for distributional forecasts? We will never observe the entire distribution of a forecasted time-step. If we did have the distribution generating the observed values, we could just use something like Kullback-Leibler divergence to measure the distance between the 2 distributions, predicted and actual. Unfortunately, the reality is that we will only observe the actual point values. So what is a good metric that allows us to compare our predicted distribution to a point value?

Proper Scoring Rules

Scoring rules are metrics for assessing the quality of a distributional forecast against the actual values of the time series. “A scoring rule is proper if the forecaster maximises the expected score for an observation drawn from the distribution F if they issue the probabilistic forecast F, rather than G≠F. It is strictly proper if the maximum is unique.” [2]. So, if a forecaster wants to achieve the best score possible, if the scoring rule is strictly proper, they can only do so by providing the true data generating distribution as their forecast. Therefore, a scoring rule that is strictly proper encourages honest distributional forecasts. Don’t worry if this definition seems a bit abstract at the moment, we will show some examples of what this means in practical terms later in this article. We will cover 2 strictly proper scoring rules, the Log Score and the CRPS (Continuous Rank Probability Score).

In this article we will use the convention that smaller values of a score are better, so contrary to the definition above, our goal will be to minimise the score instead of maximising it. We do this for consistency among different scoring rules and cost functions.

For the rest of this article we will use the same negative binomial distribution as [3] to show how we can reproduce the results from the paper using computational methods in python. For sampling from the distributions we will use numpy and for an actual distribution object we will use scipy frozen distributions.

The code below generates the distribution objects, the samples and the actuals for the examples shown in this article.

Log Score

The Log score (LogS), introduced by I. J. Good in 1952 [4], is the log of the probability density function (PDF) of the predicted distribution for the actual value i.e. LogS(F, y) = log(F(y)), where F is the PDF of the predicted distribution and y is the value of the observation. Because we use the convention that a smaller value is better for all scores, in this article we will use -LogS.

The Log Score was the metric used in the CDC’s flusight competition that has gained wide recognition in the epidemiological forecasting community.

Using a scipy frozen distribution, it’s very easy to calculate the log score in python: log_score = -np.log(frozen_dist.pdf(y))

The graph below shows the score (green line) for different values of the actuals (x-axis) for the same distribution (blue line):

The log score (green line) for the same predicted distribution (blue line) for different values of actuals.

A problem with the Log Score, also highlighted in [2] is that it doesn’t work well for sampled forecast paths in place of the probability density functions. However, sampled paths are a very common approach for serving distributional forecasts. Below is an attempt at making the log score work for samples using a normalised histogram by binning the samples.

Log Score (green line) for the same samples (yellow histogram) for different values of actuals.

Now let’s revisit the definition of the proper scoring rule. A scoring rule is proper if the forecaster minimises (because we adopt the convention that smaller values are better) the expected score for an observation drawn from the distribution F if they issue the probabilistic forecast F, rather than G≠F. As discussed above, in practice we will never know the distribution F that generates the real life data. However, it’s very easy to show this property through a simulation experiment, where we can define F beforehand.

The experiment is simple, we will choose a distribution F and we will sample a large number of observations from it, these will be the simulated actuals. We will then calculate the Log Score for distributions with a range of parameters around the distribution F we used to generate the data. If the score is indeed a proper scoring rule, its average across all simulated observations (i.e. its expectation) should be smaller for distributions that are very close to F and minimised at F.

To generate the data we use a Normal Distribution with mean, μ = 0 and standard deviation, σ = 3 and for the candidate predicted distributions we use Normal distributions where -1 ≤ μ ≤ 1 & 2 ≤ σ ≤ 4. The figure below shows the heatmap of the Log Score in this parameter space. Red indicates low and blue indicates a high log score. From the heatmap it’s clear that the the closer the parameters of the predicted distribution are to the data generating distribution, the smaller the score becomes, which is our expectation from the proper scoring property.

Simulations experiment to illustrate the proper score property of the Log Score.

Continuous Rank Probability Score (CRPS)

The first time I came across CRPS was this talk by prof. Rob Hyndman, when he was going through the details of a COVID-19 forecasting model he developed for Australia. CRPS was introduced as a proper score for continuous distributions back in 1976 by James E. Matheson and Robert L. Winkler [5].

CRPS is based on the cumulative distribution function (CDF) as opposed to the probability density function (PDF) used in the log score. This makes it more suitable for cases where the forecast is in the form of samples because we can use the empirical CDF. It is defined as:

Conceptually, CRPS is quantifying the difference of the predicted distribution CDF to a “perfect” prediction CDF. A perfect distributional forecast for a given value of an actual observation is essentially a point forecast that is equal to the actual value. The CDF of a perfect prediction is a step function going from 0 to 1, with the step being on the actual value. The grey areas in the graph below show the 2 areas that are summed to give the CRPS:

The areas that are summed to get the CRPS for a distributional forecast, similar to figure.1 in [3].

CRPS has another nice property though. If the forecast is a point prediction, CRPS is equal to the Mean Absolute Error (MAE). Some pre-high school geometry is enough to understand why, looking at the graph below:

The areas that are summed to get the CRPS for a point forecast.

The area between the 2 predictions is essentially the area of the formed parallelogram. The width is the absolute difference of the prediction and the actual, the MAE, and the height is 1. Hence the CRPS score for a point prediction is just the MAE. More formally, with some high school level calculus:

Of course, calculating the integral analytically for an empirical CDF is impossible. But that’s not a problem because we can use numerical methods such as numpy’s trapz to calculate it in a vectorized way in python. Using broadcasting and vectorization, we can calculate CRPS for multiple samples without for-loops, hence getting an easy ~10X speed-up in calculations.

CRPS (green line) for the same predicted distribution (blue line) for different values of actuals.

CRPS (green line) for the same samples (yellow histogram) for different values of actuals.

To illustrate the proper scoring property of the CRPS we can perform the same experiment as with the Log Score above. The results are as expected, the average of the score is minimised for distributions with parameters close to the data generating distribution.

Simulations experiment to illustrate the proper score property of the CRPS.

CRPS is dependent on the scale of the time series. If you want to meaningfully aggregate and compare the CRPS for multiple time series you will need to scale it somehow. A good strategy is to divide it by the CRPS of a benchmark forecast (e.g. a naive forecast) [1] [6]. For a less computationally expensive scaling option I would use the same scaling factor as in MASE (Mean Absolute Scaled Error) [8], namely the MAE of a one step Naive forecast in the training set but I would say that is this not as robust as the former scaling method.

Local VS Distance Sensitive scores

The Log Score and CRPS belong to 2 different classes of scoring rules.

The Log Score is a local scoring rule, meaning that the value of the score for a particular observation depends only on the probability density function value for the value of that observation [7]. This is clear from the definition of the log score, LogS( f, y ) = - log( f(y) ), where f the is PDF of the predicted distribution and f(y) is the value of the PDF for the observation.

CRPS is a distance sensitive score. In cases where the values have a meaningful ranking (e.g. an inventory of 10 toothbrushes is smaller than that of 20) we would usually prefer that our scoring rule takes into account the “distance” of the predicted distribution from the observed value and not just the assigned probability. For more details on the definition of a distance sensitive scoring rule you should look into section 3 in [6].

To put the difference between the 2 classes into perspective, consider the trivial case where we supply a point forecast instead of a distributional forecast. Let’s say that the observed value is 101 and the point forecast is 100, in this case the PDF of the predicted distribution f(x) is 1 if x = 100 and 0 otherwise. The Log Score for this forecast will be an undefined -log(0), because the PDF for any value other than 100 is 0 and CRPS is going to be 1, which can be considered as a good score in this case. Depending on our use case we would like our score to behave one way or the other but for inventory control, a distance sensitive score is more appropriate.

The differences between the 2 classes of scores are also very clear in the example below, where we’re calculating the score for a symmetric bi-modal distribution (double Weibull). In this case the PDF on the median of the distribution is going to be very close to zero. As a result, CRPS is minimised if the observation is equal to the median of this predicted distribution and the Log Score is maximised, meaning that the 2 scores are in a big disagreement close to the median.

Log Srore (green) and CRPS (red) for a bimodal double Weibull distribution (blue).

Putting it all together!

Now time to put these metrics in action! Unfortunately there is no library that outputs scipy frozen distributions as a distributional forecast (although admittedly that’s a good idea). However, a lot of libraries provide probabilistic forecasts in the form of sampled future paths.

For this article we’ll use statsmodels and their tried and tested time series API that gives most forecasting models a simulate method. This method produces multiple simulated future paths. We will use 2 ETS [9] models to fit the famous air line passengers series dataset. Here is how these simulated future paths look like:

Simulated future paths from statsmodels.tsa.exponential_smoothing.ets.ETSResults.simulate

The best model for this time series (based on the Akaike Information Criterion, AIC) is a model with multiplicative error, additive trend and multiplicative seasonality terms, ETS-MAM. To give that model some competition it can easily beat, we’ll compare it with an ETS model with no seasonality, specifically an ETS-MMN. We’ll use the last 12 months of the dataset as test data to evaluate the out-of-sample performance of the 2 models. For our distributional forecasts we’ll use 10,000 simulated future paths from each model (i.e. 10K samples for each forecasted time step).

ETS-MAM models probabilistic forecast in the form of 10,000 simulated future paths and the corresponding point forecast.

ETS-MMN models probabilistic forecast in the form of 10,000 simulated future paths and the corresponding point forecast.

And here are the results:

MAE, mean CRPS and mean Log Score for ETS-MAM and ETS-MMN point and probabilistic forecasts on the airline passengers series dataset.

Note that we can measure the CRPS for the point forecasts too and unsurprisingly it’s approximately equal to the MAE. Of course, the point forecasts perform worse in terms of the CRPS compared to the distributional forecasts, but CRPS is better for the accurate point forecast of the seasonal model (ETS-MAM) compared to the distributional forecast of the non-seasonal model (ETS-MMN) because it’s a distance sensitive score.

Some remarks

A popular phrase among forecasting practitioners is “horses for courses”, meaning that the performance ranking between different forecasting models depends on multiple factors, including the metric used. In our case, we want to find the right metric that will help us select the best distributional forecast model for our inventory control system. Kourentzes et al. [10] demonstrate that the link between forecast accuracy and inventory control performance is not as straightforward as we may think. In the paper they compare models with respect to point forecast metrics and inventory control performance in simulations. The results show some very interesting disagreements!

This is why we’re still experimenting with these metrics and their applications for our inventory control problems. It’s important for us to identify where the performance rankings disagree with other metrics, what their general properties are and which metric is the most suitable for our problem.

To learn more about how we use distributional forecasts for inventory control check out this presentation from Dr. Gueorgui Mihaylov, Haleon’s Principal Data Scientist for Supply Chain and Finance in the Annual OR Society conference.

To reproduce the experiments and the graphs from this article check this notebook.

Related software

properscoring — A python library with efficient implementations for CRPS and the Brier Score, leveraging numba for a significant speed-up. Unfortunately the last release was in 2015.
scoringutils — R library with a wide range of scoring rules. I also noticed that one of the authors one of [3], Prof. Nicholas Reich, is one of the contributors of this library!
scoringRules — R library with scoring rules implementations in rcpp for better performance.

References

[1] Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. OTexts.com/fpp3. Accessed on October 10th 2022.

[2] Tilmann Gneiting & Adrian E Raftery (2007) Strictly Proper Scoring Rules, Prediction, and Estimation, Journal of the American Statistical Association, 102:477, 359–378, DOI: 10.1198/016214506000001437

[3] Bracher J, Ray EL, Gneiting T, Reich NG (2021) Evaluating epidemic forecasts in an interval format. PLOS Computational Biology 17(2): e1008618. https://doi.org/10.1371/journal.pcbi.1008618

[4] Good, I.J. (1952), Rational Decisions. Journal of the Royal Statistical Society: Series B (Methodological), 14: 107–114. https://doi.org/10.1111/j.2517-6161.1952.tb00104.x

[5] Matheson, J. E., & Winkler, R. L. (1976). Scoring Rules for Continuous Probability Distributions. Management Science, 22(10), 1087–1096. http://www.jstor.org/stable/2629907

[6] Jose, V.R., Nau, R.F., & Winkler, R.L. (2009). Sensitivity to Distance and Baseline Distributions in Forecast Evaluation. Manag. Sci., 55, 582–590.

[7] Bernardo, J. M. (1979). Expected Information as Expected Utility. The Annals of Statistics, 7(3), 686–690. http://www.jstor.org/stable/2958753

[8] Hyndman, R. J., Koehler, A. B. (2006) Another look at measures of forecast accuracy. International Journal of Forecasting, Volume 22, Issue 4, Pages 679–688, ISSN 0169–2070, https://doi.org/10.1016/j.ijforecast.2006.03.001.

[9] Hyndman, R. J., Koehler, A. B., Ord, J. K., & Snyder, R. D. (2008). Forecasting with exponential smoothing: The state space approach. Springer-Verlag. http://www.exponentialsmoothing.net

[10] Kourentzes, Nikolaos and Svetunkov, Ivan and Trapero, Juan R. (July 1, 2021) Connecting forecasting and inventory performance: a complex task. Available at SSRN: https://ssrn.com/abstract=3878176 or http://dx.doi.org/10.2139/ssrn.3878176