Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Valid Error Bars for Neural Weather Models using Conformal Prediction

Vignesh Gopakumar    Joel Oskarrson    Ander Gray    Lorenzo Zanisi    Stanislas Pamela    Daniel Giles    Matt Kusner    Marc Deisenroth
Abstract

Neural weather models have shown immense potential as inexpensive and accurate alternatives to physics-based models. However, most models trained to perform weather forecasting do not quantify the uncertainty associated with their forecasts. This limits the trust in the model and the usefulness of the forecasts. In this work we construct and formalise a conformal prediction framework as a post-processing method for estimating this uncertainty. The method is model-agnostic and gives calibrated error bounds for all variables, lead times and spatial locations. No modifications are required to the model and the computational cost is negligible compared to model training. We demonstrate the usefulness of the conformal prediction framework on a limited area neural weather model for the Nordic region. We further explore the advantages of the framework for deterministic and probabilistic models.

Surrogate Models, Weather, Climate, Uncertainty Quantification, Conformal Prediction

1 Introduction

Weather and climate models are essential tools for forecasting future changes in temperature, precipitation patterns, and extreme weather events. However, these models are characterised with inherent uncertainty arising from the complexity of the Earth’s climate system and the challenges in representing all relevant processes and feedbacks (Leutbecher & Palmer, 2008). By using neural networks for weather and climate modelling, researchers can leverage the power of machine learning to capture complex patterns and relationships in climate data, potentially improving the accuracy of predictions.

Our work relies on quantifying the uncertainty associated with these neural-network-driven weather models by way of Conformal Prediction (CP). Uncertainty Quantification (UQ) is crucial for ensuring that forecasts are trustworthy and actionable. By providing robust uncertainty estimates for neural weather models, we can help decision-makers understand the range of possible outcomes with varying levels of confidence in different scenarios. This information is essential for developing effective weather management policies and mitigating adverse weather events (Bodnar et al., 2024).

2 Related Work

Traditionally, within numerical weather forecasting, uncertainty estimates are obtained via ensemble methods (Leutbecher & Palmer, 2008). These methods generate ensemble forecasts as samples from the distribution of possible future atmospheric states. Ensemble forecasting is achieved in physical models by introducing perturbations to the initial conditions (Buizza et al., 2008) and to the model parameterisations (Palmer et al., 2009). Also neural weather models largely rely on ensemble methods to provide UQ. These ensemble are created based on perturbing initial states with noise (Pathak et al., 2022; Bi et al., 2023; Scher & Messori, 2021), using sets of different neural network parameters (Graubner et al., 2022) or by training generative models (Price et al., 2023). Common to all ensemble methods is that they require rolling out multiple forecasts in order to produce uncertainty estimate. As an alternative, post-hoc approaches perform UQ across the model outputs as a post-processing step (Bülte et al., 2024; Höhlein et al., 2024). Conformal Prediction as a method of adding post-hoc uncertainty estimates has gained popularity in the recent past, finding a range of applications within spatio-temporal modelling (Sun, 2022; Ma et al., 2024; Xu et al., 2023).

3 Conformal Prediction

CP (Shafer & Vovk, 2008) gives an answer to the following question: Given some arbitrary dataset (X1,Y1),(X2,Y2),,(Xn,Yn)subscript𝑋1subscript𝑌1subscript𝑋2subscript𝑌2subscript𝑋𝑛subscript𝑌𝑛(X_{1},Y_{1}),(X_{2},Y_{2}),...,(X_{n},Y_{n})( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , … , ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), and some machine learning model f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG trained on this dataset, what is the accuracy of f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG at predicting the next true label Yn+1subscript𝑌𝑛1Y_{n+1}italic_Y start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT at query point Xn+1subscript𝑋𝑛1X_{n+1}italic_X start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT. Conformal prediction extends the point prediction y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG of f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG to a prediction set αsuperscript𝛼\mathbb{C}^{\alpha}blackboard_C start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT, which is guaranteed to contain the true label Yn+1subscript𝑌𝑛1Y_{n+1}italic_Y start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT with probability 1α1𝛼1-\alpha1 - italic_α

(Yn+1α)1α.subscript𝑌𝑛1superscript𝛼1𝛼\mathbb{P}(Y_{n+1}\in\mathbb{C}^{\alpha})\geq 1-\alpha.blackboard_P ( italic_Y start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ) ≥ 1 - italic_α . (1)

CP is attractive for quantifying errors in machine learning since the inequality in Equation 1 is guaranteed regardless of the selected machine learning model and the training dataset {(Xi,Yi)}i=1nsuperscriptsubscriptsubscript𝑋𝑖subscript𝑌𝑖𝑖1𝑛\left\{(X_{i},Y_{i})\right\}_{i=1}^{n}{ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, other than that the samples utilised for the calibration (estimating model performance) being exchangeable (a weaker form of i.i.d.).

There have been several variants of CP since its original proposal by Vovk et al. (2005). Inductive conformal prediction (Papadopoulos, 2008), which we pursue in this work, splits the training set into a proper training set (the usual training set for the underlying ML model) and a calibration set, used to construct the prediction sets \mathbb{C}blackboard_C. The inductive CP framework further extended in this paper, follows a three-step procedure as outlined in Figure 1. It starts with a calibration, followed by estimation of the quantile from the Cumulative Distribution Function (CDF) of non-conformity scores and use the estimated quantile to obtain the prediction sets.

s^=|ycy~c|^𝑠subscript𝑦𝑐subscript~𝑦𝑐\hat{s}=|y_{c}-\tilde{y}_{c}|over^ start_ARG italic_s end_ARG = | italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT |Calibrationq^=Fs^1((n+1)(1α)n)^𝑞subscriptsuperscript𝐹1^𝑠𝑛11𝛼𝑛\hat{q}=F^{-1}_{\hat{s}}\bigg{(}\frac{\lceil(n+1)(1-\alpha)\rceil}{n}\bigg{)}over^ start_ARG italic_q end_ARG = italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG end_POSTSUBSCRIPT ( divide start_ARG ⌈ ( italic_n + 1 ) ( 1 - italic_α ) ⌉ end_ARG start_ARG italic_n end_ARG )Quantile Estimationyp^=y~p±q^^subscript𝑦𝑝plus-or-minussubscript~𝑦𝑝^𝑞\hat{y_{p}}=\tilde{y}_{p}\pm\hat{q}over^ start_ARG italic_y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG = over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ± over^ start_ARG italic_q end_ARGPrediction Sets
Figure 1: Inductive CP Framework over a Deterministic Model (see RES in Section 3.1.2): (1) Perform calibration using a non-conformity metric (L1 error residual with s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG representing the calibration scores, yc,y~csubscript𝑦𝑐subscript~𝑦𝑐y_{c},\tilde{y}_{c}italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT the calibration targets and predictions respectively). (2) Estimate the quantile corresponding to the desired coverage from the CDF of the non-conformity scores (n𝑛nitalic_n represents the calibration sample size, (1α)1𝛼(1-\alpha)( 1 - italic_α ) the desired coverage, Fs^1superscriptsubscript𝐹^𝑠1F_{\hat{s}}^{-1}italic_F start_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT the quantile function applied over the inverse CDF of non-conformity scores, q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG the quantile matching the desired coverage). (3) Apply the quantile to the model predictions to estimate the prediction sets (y~psubscript~𝑦𝑝\tilde{y}_{p}over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, the model predictions and q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG the upper and lower bars for the predictions).

3.1 Conformal Prediction over a Spatio-Temporal Domain

Neural Weather Models are trained to map the evolution of spatio-temporal field variables. Starting with an initial distribution of the field variables across the spatial domain, the temporal evolution is mapped by the model in an autoregressive manner. Each model output, spanning the spatio-temporal domain of interest, is characterised as one data point within our framework. Exchangeability is maintained as each calibration sample is taken across the entire spatio-temporal domain of the output. Within our framework, we perform calibration for each cell individually, while preserving the tensorial structure. This allows us to estimate the marginal coverage for each cell, characterising a spatio-temporal point with a specific variable of interest. CP is performed across each spatio-temporal point output by the model, resulting in upper and lower coverage bands for each point. Upon calibrating for the error bars cell-wise, for validation, we average over all to estimate the coverage across the simulation domain. Within the estimation of the prediction sets, we don’t consider the influence of adjacent field points and implicitly expect the model to extract that within the learning process. We find that this approach maintains exchangeability across the various forecasts, allowing us to do CP across spatio-temporal data efficiently.

3.1.1 Formal Definition

We define a model 𝒴=f^(𝒳)𝒴^𝑓𝒳\mathcal{Y}=\hat{f}(\mathcal{X})caligraphic_Y = over^ start_ARG italic_f end_ARG ( caligraphic_X ), that learns to map the evolution of an initial temporal sequence of spatial fields (𝒳Tin×Nx×Ny×Nvar𝒳superscriptsubscript𝑇insubscript𝑁𝑥subscript𝑁𝑦subscript𝑁𝑣𝑎𝑟\mathcal{X}\in\mathbb{R}^{T_{\text{in}}\times N_{x}\times N_{y}\times N_{var}}caligraphic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT in end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_v italic_a italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT) to a later temporal sequence of spatial fields (𝒴Tout×Nx×Ny×Nvar𝒴superscriptsubscript𝑇outsubscript𝑁𝑥subscript𝑁𝑦subscript𝑁var\mathcal{Y}\in\mathbb{R}^{T_{\text{out}}\times N_{x}\times N_{y}\times N_{% \text{var}}}caligraphic_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT out end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT var end_POSTSUBSCRIPT end_POSTSUPERSCRIPT). The model inputs and outputs are characterised by 4D tensors, where Tin,Toutsubscript𝑇insubscript𝑇outT_{\text{in}},T_{\text{out}}italic_T start_POSTSUBSCRIPT in end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT out end_POSTSUBSCRIPT represents the temporal dimension of the initial states and forecast respectively, Nxsubscript𝑁𝑥N_{x}italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT represents the x𝑥xitalic_x-dimension, Nysubscript𝑁𝑦N_{y}italic_N start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT represents the y𝑦yitalic_y-dimension and Nvarsubscript𝑁varN_{\text{var}}italic_N start_POSTSUBSCRIPT var end_POSTSUBSCRIPT the field dimension (number of modelled weather variables). The calibration procedure is defined as q^=C^(𝒴,Y)^𝑞^𝐶𝒴𝑌\hat{q}=\hat{C}(\mathcal{Y},Y)over^ start_ARG italic_q end_ARG = over^ start_ARG italic_C end_ARG ( caligraphic_Y , italic_Y ), utilising the model prediction (𝒴𝒴\mathcal{Y}caligraphic_Y) and ground truth (Y𝑌Yitalic_Y) to estimate the quantile (q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG) associated with the desired coverage. The operation is executed in a point-wise manner as each 𝒴,Y,q^Tout×Nx×Ny×Nvar𝒴𝑌^𝑞superscriptsubscript𝑇outsubscript𝑁𝑥subscript𝑁𝑦subscript𝑁var\mathcal{Y},Y,\hat{q}\in\mathbb{R}^{T_{\text{out}}\times N_{x}\times N_{y}% \times N_{\text{var}}}caligraphic_Y , italic_Y , over^ start_ARG italic_q end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT out end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT var end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. The quantile is further utilised (as given in Section 3.1.2) to obtain the lower (L𝐿Litalic_L) and upper error bars (U𝑈Uitalic_U) across all the cells form the prediction set \mathbb{C}blackboard_C, where L𝐿Litalic_L and U𝑈Uitalic_U have the same dimensionality as q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG. At inference for a prediction point Xn+1subscript𝑋𝑛1X_{n+1}italic_X start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT with true label Yn+1subscript𝑌𝑛1Y_{n+1}italic_Y start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT, the expectation of coverage i.e. the prediction set, is guaranteed to satisfy

𝔼[(Yn+1L)(Yn+1U)]1α.𝔼delimited-[]subscript𝑌𝑛1𝐿subscript𝑌𝑛1𝑈1𝛼\mathbb{E}\bigg{[}(Y_{n+1}\geq L)\wedge(Y_{n+1}\leq U)\bigg{]}\geq 1-\alpha.blackboard_E [ ( italic_Y start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ≥ italic_L ) ∧ ( italic_Y start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ≤ italic_U ) ] ≥ 1 - italic_α . (2)

Equation 2 is statistically guaranteed to hold for each cell of the spatio-temporal tensor, provided we provide sufficient samples and exchangeability is maintained (Vovk, 2012).

3.1.2 Non-conformity Scores

A non-conformity score can be described as a measure of the model’s performance across the calibration dataset. Thus, the scores are expressed as a function of the trained model, the calibration inputs and outputs (Angelopoulos & Bates, 2021). Being expressed as the deviation of the model from the ground truth, we focus on two methods of estimating the non-conformity scores:

  • Absolute Error Residual (RES): Utilised for deterministic models, this method involves using the absolute error of the model across the labelled calibration dataset (Lei et al., 2018). The RES non-conformity score is s(x,y)=|yf~(x)|𝑠𝑥𝑦𝑦~𝑓𝑥s(x,y)=|y-\tilde{f}(x)|italic_s ( italic_x , italic_y ) = | italic_y - over~ start_ARG italic_f end_ARG ( italic_x ) |. Upon deriving the quantile (q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG), the prediction set is obtained as : {f~(x)q^,f~(x)+q^}~𝑓𝑥^𝑞~𝑓𝑥^𝑞\{\tilde{f}(x)-\hat{q},\;\tilde{f}(x)+\hat{q}\}{ over~ start_ARG italic_f end_ARG ( italic_x ) - over^ start_ARG italic_q end_ARG , over~ start_ARG italic_f end_ARG ( italic_x ) + over^ start_ARG italic_q end_ARG }.

  • Standard Deviation (STD): Utilised for probabilistic models, that output a Gaussian predictive distribution with a mean (μ(x))𝜇𝑥(\mathbb{\mu}(x))( italic_μ ( italic_x ) ) and standard deviation (σ(x))𝜎𝑥(\sigma(x))( italic_σ ( italic_x ) ). The non-conformity score, s(x,y)=yμ(x)σ(x)𝑠𝑥𝑦𝑦𝜇𝑥𝜎𝑥s(x,y)=\frac{y-\mathbb{\mu}(x)}{\sigma(x)}italic_s ( italic_x , italic_y ) = divide start_ARG italic_y - italic_μ ( italic_x ) end_ARG start_ARG italic_σ ( italic_x ) end_ARG, takes in the uncertainty of the model and conformalises it to provide validity over the predictive uncertainty. Upon deriving the quantile (q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG), the prediction set is obtained as : {μ(x)q^σ(x),μ(x)+q^σ(x)}𝜇𝑥^𝑞𝜎𝑥𝜇𝑥^𝑞𝜎𝑥\{\mu(x)-\hat{q}\sigma(x),\;\mu(x)+\hat{q}\sigma(x)\}{ italic_μ ( italic_x ) - over^ start_ARG italic_q end_ARG italic_σ ( italic_x ) , italic_μ ( italic_x ) + over^ start_ARG italic_q end_ARG italic_σ ( italic_x ) }.

4 Neural Weather Models

A multitude of neural network based approaches have been successfully applied to weather forecasting (Lam et al., 2023; Bi et al., 2023; Pathak et al., 2022). Irrespective of the choice of architecture and training conditions, neural weather models operate to take in initial states 𝒳𝒳\mathcal{X}caligraphic_X and map to a forecast 𝒴𝒴\mathcal{Y}caligraphic_Y. Within the scope of this paper, we focus on the Hi-LAM model of Oskarsson et al. (2024). Hi-LAM is a graph-based neural weather prediction model (Keisler, 2022; Lam et al., 2023), where a hierarchical Graph Neural Network (GNN) is utilized for producing the forecast.

Let Xtsuperscript𝑋𝑡X^{t}italic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT denote the full weather state at time step t𝑡titalic_t, including multiple atmospheric variables modelled for all grid cells in some discretisation. Examples of such atmospheric variables are temperature, wind, and solar radiation. The GNN g𝑔gitalic_g in Hi-LAM represents the single time step prediction

Xt+1=g(Xt1:t,Ft+1)superscript𝑋𝑡1𝑔superscript𝑋:𝑡1𝑡superscript𝐹𝑡1X^{t+1}=g\left(X^{t-1:t},F^{t+1}\right)italic_X start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = italic_g ( italic_X start_POSTSUPERSCRIPT italic_t - 1 : italic_t end_POSTSUPERSCRIPT , italic_F start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) (3)

where Ft+1superscript𝐹𝑡1F^{t+1}italic_F start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT are known forcing inputs that should not be predicted. Equation 3 can be applied iteratively to roll out a complete forecast of T𝑇Titalic_T time steps. The full forecasting model can thus be viewed as mapping from initial weather states 𝒳=X1:0𝒳superscript𝑋:10\mathcal{X}=X^{-1:0}caligraphic_X = italic_X start_POSTSUPERSCRIPT - 1 : 0 end_POSTSUPERSCRIPT and forcing F1:Tsuperscript𝐹:1𝑇F^{1:T}italic_F start_POSTSUPERSCRIPT 1 : italic_T end_POSTSUPERSCRIPT to a forecast 𝒴=X1:T𝒴superscript𝑋:1𝑇\mathcal{Y}=X^{1:T}caligraphic_Y = italic_X start_POSTSUPERSCRIPT 1 : italic_T end_POSTSUPERSCRIPT of shape Tout×Nx×Ny×Nvarsubscript𝑇outsubscript𝑁𝑥subscript𝑁𝑦subscript𝑁varT_{\text{out}}\times N_{x}\times N_{y}\times N_{\text{var}}italic_T start_POSTSUBSCRIPT out end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT var end_POSTSUBSCRIPT. As Hi-LAM is a limited area model, it produces weather forecasts for a specific sub-area of the globe. To achieve this, boundary condition along the edges of the forecasting area are given as forcing inputs.

The Hi-LAM models used in our experiments were trained on the original dataset of forecasts from the MetCoOp Ensemble Prediction System (MEPS) (Müller et al., 2017). Forecasts are produced for a limited area covering the Nordic region. One forecast includes Nvar=17subscript𝑁var17N_{\text{var}}=17italic_N start_POSTSUBSCRIPT var end_POSTSUBSCRIPT = 17 variables modelled on a Nx×Ny=238×268subscript𝑁𝑥subscript𝑁𝑦238268N_{x}\times N_{y}=238\times 268italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = 238 × 268 grid over T=19𝑇19T=19italic_T = 19 time steps (up to 57 hour lead time). We refer to Oskarsson et al. (2024) for further details about the model and data.

We use two versions of Hi-LAM, trained with different loss functions to build a deterministic and a probabilistic model:

  • Hi-LAM (MSE): The exact model from Oskarsson et al. (2024), trained with a weighted MSE loss. This model outputs only a single prediction, to be interpreted as the mean of the weather state.

  • Hi-LAM (NLL): A version of Hi-LAM that outputs both the mean and standard deviation for each time, variable and grid cell. This model was trained with a Negative Log-Likelihood (NLL) loss, assuming a diagonal Gaussian predictive distribution (also called uncertainty loss (Chen et al., 2023)). Apart from the change of loss function, the training setup was unchanged.

We use forecasts from September 2021 as our calibration dataset and forecasts from September 2022 as test data. By using the same month for calibration and testing we minimize the effect of distributional shifts due to seasonal effects, and can work under the assumption of exchangeability. Having access to calibration data from the same month, collected the previous year, is a reasonable assumption in practical settings. For Hi-LAM (MSE) we compute non-conformity scores using the RES strategy and for the probabilistic Hi-LAM (NLL) we use STD scores.

(a)
Refer to caption
(b)
Refer to caption
Figure 2: Prediction (top), Ground Truth (middle) and width of the error bars (bottom) for predicting the temperature 2m above ground (2t) using Hi-LAM (MSE).
(a)
Refer to caption
Hi-LAM (MSE)
(b)
Refer to caption
Hi-LAM (NLL)
Figure 3: Width of predictive interval at α=0.05𝛼0.05\alpha=0.05italic_α = 0.05 for net shortwave solar radiation flux (nswrs).

4.1 Results

Figure 2 shows the ground truth, predicted forecast and the conformalised error intervals for temperature 2m above ground. Considering the autoregressive nature of Hi-LAM, the error accumulates and grows further in time, which is accurately captured by the CP framework (refer Figure 4).

We visualise the uncertainty for specific lead times by plotting the width of the error bars for all spatial locations in an example forecast. Such plots for shortwave solar radiation are shown in Figure 3 and for geopotential in Figure 4.In Appendix A we show plots for additional variables, including slice plots to help visualise the error bars obtained for different α𝛼\alphaitalic_α values using the CP framework.

Figure 3 highlights an important difference between the two methods for computing non-conformity scores. As the shortwave solar radiation is close to 0 during the night, it is easy for the model to predict. During the day this is far more challenging, with the solar radiation being filtered through sparse and dynamic cloud cover. With the RES non-conformity scores, used for Hi-LAM (MSE) in Figure 3, the width of the predictive intervals are determined during calibration, and does not change depending on the forecast from the model. As a specific lead time can fall both during day and night, depending on the initialization time, CP will give large error bars also during the night. This can be compared to the results for Hi-LAM (NLL) in Figure 3, using STD non-conformity scores. In this case the bounds are very tight for lead times during the night (33 h and 57 h). It can also be noted that for Hi-LAM (NLL) at lead time 15 h we see clear spatial features appearing in the error bars themselves. This corresponds to higher forecast uncertainty in areas of rapid change. The forecast-dependent patterns for Hi-LAM (NLL) thus have desirable properties, but this relies on having a model that outputs (potentially uncalibrated) standard-deviations.

We also evaluate the empirical coverage, the fraction of the data that lays within the conformalised error bars. From Equation 2 this should match the chosen value of 1α1𝛼1-\alpha1 - italic_α. For 1α=90%1𝛼percent901-\alpha=90\%1 - italic_α = 90 % we get coverage of 91.09%percent91.0991.09\%91.09 % for Hi-LAM (MSE) and 91.08%percent91.0891.08\%91.08 % for Hi-LAM (NLL). This can be compared to a coverage of 74.62%percent74.6274.62\%74.62 % for bounds computed directly from the uncalibrated Gaussians output by Hi-LAM (NLL). In Appendix A we plot the empirical coverage for a range of values of 1α1𝛼1-\alpha1 - italic_α. Overall the STD method applied to Hi-LAM (NLL) gives tighter bounds than RES for Hi-LAM (MSE) (0.96 vs 1.13 average width, in standardized units).

(a)
Refer to caption
Hi-LAM (MSE)
(b)
Refer to caption
Hi-LAM (NLL)
Figure 4: Width of predictive interval at α=0.05𝛼0.05\alpha=0.05italic_α = 0.05 for geopotential at 500 hPa (z500). Both models here show a certain spatial pattern, which can be attributed to how the GNN in Hi-LAM is defined over the forecasting area.

5 Discussion

Due to the chaotic nature of the weather system, capturing uncertainty in weather forecasts has long been an important consideration both in research and operations. Ensemble forecasting requires a computational cost proportional to the number of ensemble members. In contrast, CP offers a cheap method to immediately quantify forecast uncertainty for a time, position, and variable of interest, and can be applied to existing deterministic models. The CP framework does come with several limitations such as marginal coverage, lack of predictive distribution and the need for exchangeability. Though extensive work is being done to overcome these limitations, CP in its current form does pose significant utility in providing statistically valid error bars.

References

  • Angelopoulos & Bates (2021) Angelopoulos, A. N. and Bates, S. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv:2107.07511, 2021. doi: 10.48550/ARXIV.2107.07511. URL https://arxiv.org/abs/2107.07511.
  • Bi et al. (2023) Bi, K., Xie, L., Zhang, H., Chen, X., Gu, X., and Tian, Q. Accurate medium-range global weather forecasting with 3d neural networks. Nature, 619(7970):533–538, Jul 2023. ISSN 1476-4687. doi: 10.1038/s41586-023-06185-3. URL https://doi.org/10.1038/s41586-023-06185-3.
  • Bodnar et al. (2024) Bodnar, C., Bruinsma, W., Lucic, A., Stanley, M., Brandstetter, J., Garvan, P., Riechert, M., Weyn, J., Dong, H., Vaughan, A., Gupta, J., Tambiratnam, K., Archibald, A., Heider, E., Welling, M., Turner, R., and Perdikaris, P. Aurora: A foundation model of the atmosphere. Technical Report MSR-TR-2024-16, Microsoft Research AI for Science, May 2024. URL https://www.microsoft.com/en-us/research/publication/aurora-a-foundation-model-of-the-atmosphere/.
  • Buizza et al. (2008) Buizza, R., Leutbecher, M., and Isaksen, L. Potential use of an ensemble of analyses in the ecmwf ensemble prediction system. Quarterly Journal of the Royal Meteorological Society, 134(637):2051–2066, 2008. doi: https://doi.org/10.1002/qj.346. URL https://rmets.onlinelibrary.wiley.com/doi/abs/10.1002/qj.346.
  • Bülte et al. (2024) Bülte, C., Horat, N., Quinting, J., and Lerch, S. Uncertainty quantification for data-driven weather models, 2024.
  • Chen et al. (2023) Chen, K., Han, T., Gong, J., Bai, L., Ling, F., Luo, J.-J., Chen, X., Ma, L., Zhang, T., Su, R., Ci, Y., Li, B., Yang, X., and Ouyang, W. FengWu: Pushing the skillful global medium-range weather forecast beyond 10 days lead. arXiv preprint arXiv:2304.02948, 2023.
  • Graubner et al. (2022) Graubner, A., Kamyar Azizzadenesheli, K., Pathak, J., Mardani, M., Pritchard, M., Kashinath, K., and Anandkumar, A. Calibration of large neural weather models. In NeurIPS 2022 Workshop on Tackling Climate Change with Machine Learning, 2022.
  • Höhlein et al. (2024) Höhlein, K., Schulz, B., Westermann, R., and Lerch, S. Postprocessing of ensemble weather forecasts using permutation-invariant neural networks. Artificial Intelligence for the Earth Systems, 3(1), January 2024. ISSN 2769-7525. doi: 10.1175/aies-d-23-0070.1. URL http://dx.doi.org/10.1175/AIES-D-23-0070.1.
  • Keisler (2022) Keisler, R. Forecasting global weather with graph neural networks. arXiv preprint arXiv:2202.07575, 2022.
  • Lam et al. (2023) Lam, R., Sanchez-Gonzalez, A., Willson, M., Wirnsberger, P., Fortunato, M., Alet, F., Ravuri, S., Ewalds, T., Eaton-Rosen, Z., Hu, W., Merose, A., Hoyer, S., Holland, G., Vinyals, O., Stott, J., Pritzel, A., Mohamed, S., and Battaglia, P. Learning skillful medium-range global weather forecasting. Science, 382(6677):1416–1421, 2023. doi: 10.1126/science.adi2336. URL https://www.science.org/doi/abs/10.1126/science.adi2336.
  • Lei et al. (2018) Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., and Wasserman, L. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094–1111, 2018. doi: 10.1080/01621459.2017.1307116. URL https://doi.org/10.1080/01621459.2017.1307116.
  • Leutbecher & Palmer (2008) Leutbecher, M. and Palmer, T. Ensemble forecasting. Journal of Computational Physics, 227(7):3515–3539, 2008. ISSN 0021-9991. doi: https://doi.org/10.1016/j.jcp.2007.02.014. URL https://www.sciencedirect.com/science/article/pii/S0021999107000812. Predicting weather, climate and extreme events.
  • Ma et al. (2024) Ma, Z., Azizzadenesheli, K., and Anandkumar, A. Calibrated uncertainty quantification for operator learning via conformal prediction. arXiv preprint arXiv:2402.01960, 2024.
  • Müller et al. (2017) Müller, M., Homleid, M., Ivarsson, K.-I., Køltzow, M. A. Ø., Lindskog, M., Midtbø, K. H., Andrae, U., Aspelien, T., Berggren, L., Bjørge, D., Dahlgren, P., Kristiansen, J., Randriamampianina, R., Ridal, M., and Vignes, O. AROME-MetCoOp: A nordic convective-scale operational weather prediction model. Weather and Forecasting, 2017.
  • Oskarsson et al. (2024) Oskarsson, J., Landelius, T., Deisenroth, M. P., and Lindsten, F. Probabilistic weather forecasting with hierarchical graph neural networks. arXiv preprint arXiv:2406.04759, 2024.
  • Palmer et al. (2009) Palmer, T., Buizza, R., Doblas-Reyes, F., Jung, T., Leutbecher, M., Shutts, G., Steinheimer, M., and Weisheimer, A. Stochastic parametrization and model uncertainty, 10/2009 2009. URL https://www.ecmwf.int/node/11577.
  • Papadopoulos (2008) Papadopoulos, H. Inductive conformal prediction: Theory and application to neural networks. In Fritzsche, P. (ed.), Tools in Artificial Intelligence, chapter 18. IntechOpen, Rijeka, 2008. doi: 10.5772/6078.
  • Pathak et al. (2022) Pathak, J., Subramanian, S., Harrington, P., Raja, S., Chattopadhyay, A., Mardani, M., Kurth, T., Hall, D., Li, Z., Azizzadenesheli, K., Hassanzadeh, P., Kashinath, K., and Anandkumar, A. Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators. arXiv:2202.11214, 2022.
  • Price et al. (2023) Price, I., Sanchez-Gonzalez, A., Alet, F., Ewalds, T., El-Kadi, A., Stott, J., Mohamed, S., Battaglia, P., Lam, R., and Willson, M. Gencast: Diffusion-based ensemble forecasting for medium-range weather. arXiv preprint arXiv:2312.15796, 2023.
  • Scher & Messori (2021) Scher, S. and Messori, G. Ensemble methods for neural network-based weather forecasts. Journal of Advances in Modeling Earth Systems, 2021.
  • Shafer & Vovk (2008) Shafer, G. and Vovk, V. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(3), 2008.
  • Sun (2022) Sun, S. Conformal methods for quantifying uncertainty in spatiotemporal data: A survey, 2022.
  • Vovk (2012) Vovk, V. Conditional validity of inductive conformal predictors. In Asian Conference on Machine Learning, 2012.
  • Vovk et al. (2005) Vovk, V., Gammerman, A., and Shafer, G. Algorithmic Learning in a Random World. Springer, 2005.
  • Xu et al. (2023) Xu, C., Xie, Y., Vazquez, D. A. Z., Yao, R., and Qiu, F. Spatio-temporal wildfire prediction using multi-modal data. IEEE Journal on Selected Areas in Information Theory, 4:302–313, 2023. doi: 10.1109/JSAIT.2023.3276054.

Appendix A Additional Results

Figures 5 to 8 show examples of predictive intervals at α=0.05𝛼0.05\alpha=0.05italic_α = 0.05 for additional variables from the models. In figure 9 we plot the empirical coverage for both models and different values of 1α1𝛼1-\alpha1 - italic_α. In Figure 10, we show slice plots along the spatial domain to further highlight the error bars obtained at various coverage levels using the CP framework.

(a)
Refer to caption
Hi-LAM (MSE)
(b)
Refer to caption
Hi-LAM (NLL)
Figure 5: Width of predictive interval at α=0.05𝛼0.05\alpha=0.05italic_α = 0.05 for relative humidity at 2 m above ground (2r).
(a)
Refer to caption
Hi-LAM (MSE)
(b)
Refer to caption
Hi-LAM (NLL)
Figure 6: Width of predictive interval at α=0.05𝛼0.05\alpha=0.05italic_α = 0.05 for water vapour, integrated over the column above the grid cell (wvint).
(a)
Refer to caption
Hi-LAM (MSE)
(b)
Refer to caption
Hi-LAM (NLL)
Figure 7: Width of predictive interval at α=0.05𝛼0.05\alpha=0.05italic_α = 0.05 for u-component of wind vector at model level 65 (u65). Level 65 is the bottom level modelled in MEPS, and sits at approximately 12.5 m above ground.
(a)
Refer to caption
Hi-LAM (MSE)
(b)
Refer to caption
Hi-LAM (NLL)
Figure 8: Width of predictive interval at α=0.05𝛼0.05\alpha=0.05italic_α = 0.05 for u-component of wind vector at model level 65 (v65). Level 65 is the bottom level modelled in MEPS, and sits at approximately 12.5 m above ground.
Refer to caption
(a) Hi-LAM (MSE)
Refer to caption
(b) Hi-LAM (NLL)
Figure 9: Empirical Coverage at different levels of 1α1𝛼1-\alpha1 - italic_α for Hi-LAM (MSE) and Hi-LAM (NLL).
Refer to caption
(a) Hi-LAM (MSE), α=0.05𝛼0.05\alpha=0.05italic_α = 0.05
Refer to caption
(b) Hi-LAM (MSE), α=0.85𝛼0.85\alpha=0.85italic_α = 0.85
Refer to caption
(c) Hi-LAM (NLL), α=0.05𝛼0.05\alpha=0.05italic_α = 0.05
Refer to caption
(d) Hi-LAM (NLL), α=0.85𝛼0.85\alpha=0.85italic_α = 0.85
Figure 10: Slice plots across the x𝑥xitalic_x-axis of a temporal prediction of a single variable (u-component of wind). Figures (a) - (d) depicts the ground truth, prediction, upper and lower bars obtained through the CP framework for the Hi-LAM (MSE) and Hi-LAM (NLL) for 95 percent coverage (α=0.05𝛼0.05\alpha=0.05italic_α = 0.05) and 15 percent coverage (α=0.85𝛼0.85\alpha=0.85italic_α = 0.85).