This study focused on examining the instances of aborted landings due to wind shear at Hong Kong International Airport (HKIA). Utilizing PIREP data from 1 January 2015 to 23 July 2023 from HKIA, a thorough analysis revealed a total of 3585 wind shear events affecting both arriving and departing flights. Of these, our research specifically examined the 2024 cases reported by flights arriving at HKIA. In these 2024 wind shear incidents, there were 476 aborted landings and 1552 successful landings. Standard protocols for data preparation and pre-processing were followed in this study [
37]. The data were randomly divided, with 70% allocated to training and the remaining 30% for testing, with all deep learning models being evaluated on this split. This division was carried out using a randomly chosen seed.
A binary classification problem was established, designating successful landings during wind as the majority class and aborted landings as the minority class. In order to tackle the issue of imbalanced data, three data augmentation techniques including SMOTE, KMeans-SMOTE, and Borderline-SMOTE were used to balance the training data. Post-treatment, the dataset consisted of 1093 successful and 1093 aborted landing instances, as depicted in
Figure 8. The research was conducted in a Jupyter Notebook environment, using custom-written Python code that leveraged libraries such as Pandas, Numpy, sklearn, TensorFlow, and DeepTables.
For the conventional DNN and WDN models, key hyperparameters included the number of hidden layers, the neurons in each layer, the type of activation function, the training algorithm, and the learning rate. The activation function and optimizer, being non-numeric, were conventionally transformed into numerical proxies. Additionally, a uniform dropout rate of 0.1 was applied across all hidden layers in the networks. In the case of the DCN model, an extra hyperparameter, the number of cross layers, was also considered.
Table A1,
Table A2 and
Table A3 in
Appendix A display the ranges and optimal values for these hyperparameters as determined by Bayesian optimization under various data treatment strategies.
4.1. Performance Analysis and Comparison
Following the determination of the optimal hyperparameter combination, the deep learning models were subsequently retrained on the established training dataset. Monitoring of the validation loss was performed in order to mitigate the risk of over-fitting, and the strategy of early stopping was employed. A confusion matrix was developed for the proposed deep learning models using both the original untreated data and resampled data. Initially, it was observed that deep learning models exhibited a poor performance when applied to imbalanced datasets, as shown in
Figure 9. Among the 150 aborted landing instances in the testing dataset, the DNN correctly classified 8 instances, the DCN correctly classified 30 instances, and the WDN correctly classified 7 instances of aborted landings. The performance metrics of the proposed deep learning models, when applied to untreated data, are presented in
Table 2. The experimental results indicate that the utilization of untreated PIREP data in DNN, DCN, and WDN models led to a decrease in the F1-score, with values of 9.88%, 29.27%, and 8.33%, respectively. Additionally, the MCC values were also lower, measuring 0.138, 0.218, and 0.057 for DNN, DCN, and WDN, respectively.
A confusion matrix, as shown in
Figure 10, was subsequently constructed for the conventional DNN model using data that had undergone pre-processing techniques including SMOTE, KMeans-SMOTE, and Borderline-SMOTE. The utilization of data processing techniques led to significant improvements in the precise categorization of aborted landings. In the context of the DNN + SMOTE scenario, it was observed that 116 out of 150 instances accurately classified aborted landings. In the context of DNN + KMeans-SMOTE, the accurate classification was attained in 118 instances out of a total of 150. DNN + Borderline-SMOTE resulted in the efficient classification of 128 instances of aborted landings out of 150. The findings shown in
Table 3 indicate that the DNN + SMOTE, DNN + KMeans-SMOTE, and DNN + Borderline-SMOTE models led to a higher F1-score, with values of 69.05%, 71.73%, and 77.34%, respectively. The MCC values were also higher compared to untreated data, measuring 0.583, 0.618, and 0.701, respectively.
The development of a confusion matrix for the DCN model also involved the utilization of SMOTE, KMeans-SMOTE, and Borderline-SMOTE, as shown in
Figure 11. In the case of DCN, data treatment techniques also resulted in notable advancements in the precise categorization of aborted landings. In the specific scenario involving DCN + SMOTE, it was noted that 127 instances out of a total of 150 were successfully classified as aborted landings. In the context of the DCN + KMeans-SMOTE approach, a total of 150 instances were evaluated, resulting in accurate classification in 134 instances. The implementation of the DCN + Borderline-SMOTE technique yielded a successful classification rate of 142 out of 150 instances of aborted landings. The results presented in
Table 4 demonstrate that the DCN + SMOTE, DCN + KMeans-SMOTE, and DCN + Borderline-SMOTE models yielded superior F1-scores, achieving 73.41%, 76.15%, and 82.56%, respectively. The MCC values also exhibited higher magnitudes, with respective measurements of 0.642, 0.686, and 0.773.
A confusion matrix was also developed for the WDN models, as shown in
Figure 12, and the findings are displayed in
Table 5, which indicate that the WDN + SMOTE, WDN + KMeans-SMOTE, and WDN + Borderline-SMOTE models exhibited F1-scores of 68.48%, 74.18%, and 78.75%, respectively, and the MCC metric was measured at 0.576, 0.657, and 0.719, respectively.
Based on above findings, it can be concluded that the DCN + Borderline-SMOTE and WDN + Borderline-SMOTE techniques yielded superior F1-score values of 82.56% and 78.75%, respectively. These techniques also demonstrated higher MCC scores of 0.773 and 0.719, respectively. In addition, the optimal deep learning models were compared to binary logistic regression (BLR) using both untreated and treated data. The findings indicate that both the F1-score and MCC obtained from BLR, in the case of both untreated and treated data, were much lower than those obtained for optimal deep learning models, as shown in
Table 6. The closeness of the results for the optimal deep learning models necessitated the utilization of SHAP analysis for the interpretation of these models, as detailed in the following section.
4.2. Interpretation of Optimal Deep Learning Models
The development of an accurate deep learning model for aborted landings is of great importance, as an optimized deep learning model has the potential to provide a deeper comprehension of the relationship between aborted landings and the various factors that contribute to them. Following predictive analysis by deep learning models, SHAP bee swarm plots [
40] were generated for both optimal DCN + Borderline-SMOTE and WDN + Borderline-SMOTE in order to evaluate the significance and contribution of various factors. As depicted in
Figure 13, the input factors are arranged on the vertical axis in descending order of ascending influence, commencing with the factor exerting the greatest influence. The plot illustrates the contribution of these factors, with the SHAP value represented on the horizontal axis and a color scale ranging from blue (indicating low significance) to red (indicating high significance).
For both optimal deep learning models, the three primary factors that exhibited significance were the intensity of wind shear, the assigned approach runway, and the vertical distance of wind shear from the runway. These findings indicate that although there may be slight variations in the performance of these deep learning models, each one may possess distinct advantages in different scenarios. When considering the intensity of wind shear, it can be observed that the blue dots are positioned to the right of the vertical reference line on the SHAP bee swarm plot. This positioning suggests a significant impact of negative wind shear magnitude, indicating the impact of tail wind shear and its influence on aborted landings during wind shear events. In a similar vein, it can be observed that runways at HKIA that are assigned lower codes are indicative of a greater impact on the occurrence of aborted landings. The occurrence of southerly or southeasterly gusts of wind at HKIA increases the probability of wind shear, potentially resulting in notable aborted landings at runway 07R. The aborted landings were additionally impacted by the lower altitude of wind shear events. The results of the factor importance and contribution analysis presented in this study were found to be consistent with previous research conducted by others [
41,
42,
43].
Furthermore, SHAP interaction plots were developed to analyze the top three factors that are considered significant.
Figure 14a illustrates the correlation between the intensity of wind shear and the vertical distance of wind shear from the runway. The presence of red and blue dots positioned above a horizontal green reference dashed line signifies a significant level of effect exerted by the respective factor. It can be observed that the combined influence of tail wind shear, as indicated by negative values, and the low altitude of wind shear, as indicated by blue dots, often results in aborted landings. Nevertheless, the observed head wind shear does not have any substantial influence, and the presence of wind shear at high altitudes does not yield any noteworthy effects on aborted landings. The information presented in
Figure 14b indicates there is a higher probability of aborted landings due to trail wind shear at runway 07R. Nevertheless, there were no noteworthy instances of aborted landings recorded on other runways.
Figure 14c shows a notable concentration of purple dots, symbolizing runway 07R, in the region situated above the horizontal green reference dashed line and below an altitude of 700 ft, indicating a higher occurrence of aborted landings. Based on the findings, it may be inferred that there is a higher probability of aborted landings for aircraft at runway 07R when wind shear phenomena manifest at altitudes below 700 ft.