15.3 Explanatory variables selection
There are different approaches for automatic variable selection, but not all of them are efficient in the context of dynamic models. For example, conventional stepwise approaches might be either not feasible in the case of small samples or may take too much time to converge to an optimal solution (it has polynomial computational time). This well known problem in regression context is magnified in the context of dynamic models, because each model fit takes much more time than in the case of regression. This is because the ADAMX needs to be refitted and re-estimated repeatedly using recursive relations based on the state space model (10.4). So, there need to be some simplifications, which will make variables selection in ADAMX doable in a reasonable time.
To make the mechanism efficient in a limited time, I propose using the Sagaert and Svetunkov (2022) approach of stepwise trace forward selection of variables. It uses partial correlations between variables to identify which of them to include in each iteration. While it has linear computational time instead of the polynomial, doing that in the proper ADAMX would still take a lot of time, because of the fitting of the dynamic model. So one of the possible solutions is to do variables selection in ADAMX based on models residuals, in the following steps:
- Estimate and fit the ADAM;
- Extract the residuals of the ADAM;
- Select the most suitable variables, explaining the residuals, based on the trace forward stepwise approach and the selected information criterion;
- Re-estimate the ADAMX with the selected explanatory variables.
The residuals in step (2) might vary from model to model, depending on the type of the error term and the selected distribution:
- Normal, Laplace, S, Generalised Normal or Asymmetric Laplace: \(e_t\);
- Additive error and Log-Normal, Inverse Gaussian or Gamma: \(\left(1+\frac{e_t}{\hat{y}_t} \right)\);
- Multiplicative error and Log-Normal, Inverse Gaussian or Gamma: \(1+e_t\).
So, the extracted residuals should be aligned with the distributional assumptions of each model.
In R, step (3) is done using the stepwise()
function from the greybox
package, supporting all the distributions implemented in ADAM. The only thing that needs to be modified is the number of degrees of freedom: the function should consider all estimated parameters (including the number of parameters of the dynamic part). This is done internally via the df
parameter in stepwise()
.
While the suggested approach has obvious limitations (e.g. smoothing parameters can be higher than needed, explaining the variability otherwise explained by variables), it is efficient in terms of computational time.
To see how it works, we use SeatBelt data:
We have already had a look at this data earlier in Section 10.6, so we can move directly to the selection part:
adamETSXMNMSelectSeat <- adam(SeatbeltsData, "MNM",
h=12, holdout=TRUE,
regressors="select")
summary(adamETSXMNMSelectSeat)
## Warning: Observed Fisher Information is not positive semi-definite, which means
## that the likelihood was not maximised properly. Consider reestimating the
## model, tuning the optimiser or using bootstrap via bootstrap=TRUE.
##
## Model estimated using adam() function: ETSX(MNM)
## Response variable: drivers
## Distribution used in the estimation: Gamma
## Loss function type: likelihood; Loss function value: 1125.107
## Coefficients:
## Estimate Std. Error Lower 2.5% Upper 97.5%
## alpha 0.1968 0.0559 0.0864 0.3070 *
## gamma 0.0891 0.0673 0.0000 0.2218
## level 1670.0894 73.8129 1524.3433 1815.6589 *
## seasonal_1 1.0138 0.0320 0.9648 1.0892 *
## seasonal_2 0.9119 0.0248 0.8629 0.9872 *
## seasonal_3 0.9295 0.0294 0.8805 1.0049 *
## seasonal_4 0.8729 0.0271 0.8240 0.9483 *
## seasonal_5 0.9457 0.0293 0.8967 1.0211 *
## seasonal_6 0.9153 0.0289 0.8664 0.9907 *
## seasonal_7 0.9647 0.0301 0.9157 1.0401 *
## seasonal_8 0.9752 0.0307 0.9263 1.0506 *
## seasonal_9 0.9939 0.0280 0.9450 1.0693 *
## seasonal_10 1.0717 0.0289 1.0227 1.1470 *
## seasonal_11 1.2118 0.0382 1.1629 1.2872 *
## law 0.0200 0.0949 -0.1674 0.2073
##
## Error standard deviation: 0.0783
## Sample size: 180
## Number of estimated parameters: 16
## Number of degrees of freedom: 164
## Information criteria:
## AIC AICc BIC BICc
## 2282.213 2285.551 2333.300 2341.966
Remark. The summary()
method might complain about the observed Fisher Information. This only means that the estimated variances of parameters might be lower than they should be in reality. This is discussed in Section 16.2.
Based on the summary from the model, we can see that neither kms
nor PetrolPrice
improve the model in terms of AICc (they were not included in the model). We could check them manually to see if the selection worked out well in our case (construct sink regression as a benchmark):
## Warning: Observed Fisher Information is not positive semi-definite, which means
## that the likelihood was not maximised properly. Consider reestimating the
## model, tuning the optimiser or using bootstrap via bootstrap=TRUE.
##
## Model estimated using adam() function: ETSX(MNM)
## Response variable: drivers
## Distribution used in the estimation: Gamma
## Loss function type: likelihood; Loss function value: 1131.148
## Coefficients:
## Estimate Std. Error Lower 2.5% Upper 97.5%
## alpha 0.1890 0.0071 0.1751 0.2029 *
## gamma 0.0675 0.0405 0.0000 0.1473
## level 4089.8186 240.2183 3615.4557 4563.5334 *
## seasonal_1 0.9445 0.0466 0.8766 1.0670 *
## seasonal_2 0.8426 0.0434 0.7747 0.9651 *
## seasonal_3 0.9083 0.0509 0.8404 1.0308 *
## seasonal_4 0.8528 0.0481 0.7849 0.9753 *
## seasonal_5 0.9773 0.0522 0.9094 1.0998 *
## seasonal_6 0.9489 0.0517 0.8810 1.0714 *
## seasonal_7 1.0364 0.0438 0.9685 1.1589 *
## seasonal_8 1.0697 0.0344 1.0018 1.1922 *
## seasonal_9 1.0449 0.0598 0.9770 1.1674 *
## seasonal_10 1.0936 0.0621 1.0257 1.2161 *
## seasonal_11 1.1454 0.0544 1.0775 1.2679 *
## kms 0.0000 0.0000 0.0000 0.0000
## PetrolPrice 0.4248 1.0763 -1.7005 2.5472
## law 0.0191 0.0716 -0.1223 0.1603
##
## Error standard deviation: 0.082
## Sample size: 180
## Number of estimated parameters: 18
## Number of degrees of freedom: 162
## Information criteria:
## AIC AICc BIC BICc
## 2298.296 2302.544 2355.769 2366.800
We can see that the sink regression model has a higher AICc value than the model with the selected variables, which means that the latter is closer to the “true model”. While adamETSXMNMSelectSeat
might not be the best possible model in terms of information criteria, it is still a reasonable one and can be used for further inference.