Prediction of Sentinel-2 multi-band imagery with attention BiLSTM for continuous earth surface monitoring

Abstract

Continuous monitoring of crops and forecasting crop conditions through time series analysis is crucial for effective agricultural management. This study proposes a framework based on an attention Bidirectional Long Short-Term Memory (BiLSTM) network for predicting multiband images. Our model can forecast target images on user-defined dates, including future dates and periods characterized by persistent cloud cover. By focusing on short sequences within a sequence-to-one forecasting framework, the model leverages advanced attention mechanisms to enhance prediction accuracy. Our experimental results demonstrate the model’s superior performance in predicting NDVI, multiple vegetation indices, and all Sentinel-2 bands, highlighting its potential for improving remote sensing data continuity and reliability.

1 Introduction

Accurately predicting missing bands in remote sensing data is critical for continuous monitoring and analysis of Earth’s surface processes. Traditional time series interpolation methods are commonly used for reconstructing historical missing images. However, these methods often struggle with data quality issues, such as cloud cover, that can obscure critical data, especially in optical satellite sensors [1]. Recent studies also utilize Synthetic Aperture Radar (SAR) signals, which are less affected by cloud cover, to impute missing images due to their robustness in adverse weather conditions [2, 3, 4]. The Gaussian process has also been recognized as a potent method for imputing bands in time series data by capturing the spatial coherence and temporal regularity inherent in remote sensing data [5].

The integration of attention mechanisms in LSTM networks has shown promising results in various studies [6, 7, 8]. These mechanisms focus the model on relevant parts of the input data, improving the accuracy of predictions in complex datasets characterized by missing or incomplete information [9, 10]. For instance, Gerber et al. (2018) [1] highlighted the potential of spatio-temporal prediction methods to fill gaps in datasets with extensive missing values due to cloud cover and other artefacts.

Our proposed framework utilizes a BiLSTM network with an attention mechanism, enhancing the model’s ability to handle short sequences in a sequence-to-one forecasting framework. This approach builds on the demonstrated effectiveness of LSTM networks in handling temporal data gaps by learning from both past and future contexts, which is particularly advantageous for predicting cloud-free images acquired in user-defined time.

2 Method

Refer to caption — Fig. 1: Flowchart of the vanilla attention BiLSTM. The processing flow in this figure takes 5 time series bands as an example. The additional times series, which is highlighted in red, is the time difference time series.

In designing and implementing our neural network architecture (Fig.1), we integrated advanced deep learning techniques to address complex sequence learning tasks. The model commences with an input layer configured to accommodate the dimensions of the input sequence, specifically tuned to the time steps and feature counts of the dataset in use. This is followed by a series of LSTM layers, where the initial layer is bidirectional, enabling the network to learn from the sequence data in both forward and reverse directions. Each LSTM layer is accompanied by batch normalization and ReLU activation functions, which are critical for stabilizing the neural network by normalizing the activations and introducing non-linearity, respectively.

Furthermore, we incorporate an attention mechanism, which selectively weighs the importance of different time steps across the input sequence. This is achieved through a dense layer with softmax activation that outputs attention probabilities. These probabilities are then element-wise multiplied with the LSTM outputs to emphasize features of higher relevance to the task at hand. Post-attention application, the model leverages a Global Average Pooling 1D layer to condense the feature map across time steps into a single vector, reducing the model’s complexity and computational demands while retaining essential temporal features. The output layer, constructed with a dense layer, is tailored with an adjustable number of neurons corresponding to the desired outputs and utilizes a linear activation function to produce the final prediction.

The methodological choices in constructing this neural network, from the layered architecture to the incorporation of attention and advanced regularization techniques, are designed to optimize performance for predicting multi-dimensional outputs from sequential data. This comprehensive approach not only ensures robust learning capabilities but also adapts effectively to various types of sequence-based prediction tasks.

3 Experimental results and discussion

3.1 Dataset

To compare the proposed method with other LSTM-based methods, we tested them with 3 databases prepared using selected multitemporal cloud-free Sentinel-2 images. Only the areas acquired over vineyards are used to prepare the database. The time steps are 5, and the channels all contain the time difference channel. The time difference (Fig.2) is computed between the target acquisition time and the previous 5 band acquisition times. It can be defined by the user accounting for the specific requirement. The mean and standard deviation of different Sentinel-2 bands and vegetation indices computed with the images acquired in a vineyard are introduced in Fig.4.

Methods	RMSE-NDVI	MAPE-NDVI	RMSE-VIs	MAPE-VIs	RMSE-s2	MAPE-S2
attention BiLSTM	0.0323	4.796	0.0281	5.348	0.0174	6.690
attention LSTM	0.0315	4.469	0.0313	5.545	0.0169	7.403
BiLSTM	0.0329	5.174	0.0318	5.944	0.0183	7.3700
attention ConvLSTM2D	0.0333	5.2096	0.0302	5.576	0.0176	7.991
ConvLSTM2D	0.0320	4.854	0.0313	5.474	0.0179	7.235

Table 1: Performance comparison of different methods. The best results are in bold and the second best are underlined

3.2 NDVI forecasting

The performance metrics in NDVI forecasting, as shown in Tab.1, reveal that the attention LSTM model achieved the best performance with the lowest RMSE (0.0315) and MAPE (4.469), indicating a robust capability to model and predict NDVI accurately. This outcome suggests that LSTM with attention mechanisms effectively captures temporal dynamics crucial for NDVI estimation. The attention BiLSTM provides the second best MAPE result, and one test result over unseen data can be seen in Fig.3. The ConvLSTM2D model, despite not using an attention mechanism, also performed commendably, achieving the second-best RMSE (0.0320) and a competitive MAPE (4.854). This section underscores the importance of both LSTM-based models and attention approaches in handling the temporal-spatial complexity of NDVI data from satellite imagery.

3.3 Multi vegetation indices forecasting

In forecasting multiple vegetation indices, the attention BiLSTM model outperformed other architectures, achieving the best RMSE (0.0281) and MAPE (5.348). One forecasting example can be found in Fig.5. These results suggest that the bidirectional processing of LSTM layers, combined with an attention mechanism, provides a significant advantage in handling the complexities of multiple vegetation indices. Notably, the attention ConvLSTM2D model also showed strong performance with the second-best RMSE (0.0302), indicating the effectiveness of integrating convolutional layers for spatial feature extraction alongside LSTM’s temporal analysis.

3.4 Sentinel-2 bands forecasting

For the Sentinel-2 bands forecasting, the attention BiLSTM model again stands out, recording the best MAPE (6.690) and a highly competitive RMSE (0.0174), underscoring its effectiveness in processing multi-spectral satellite data. The attention LSTM model provided the lowest RMSE (0.0169), although with a slightly higher MAPE, highlighting the trade-offs between different error metrics and the model’s sensitivity to outlier predictions. From Fig.4, we can see that Sentinel-2 time series bands have larger standard deviations than Sentinel-2 vegetation indices. It highlighted the robustness of the model performances. These comparisons show how different architectures manage the high-dimensional nature of Sentinel-2 data, with attention mechanisms improving the model’s focus on relevant spectral bands.

4 Conclusion

This study successfully demonstrated the application of an attention BiLSTM network for the prediction of multiband remote sensing images, aimed at enhancing crop monitoring and forecasting under varying cloud cover conditions. The experimental results underscore the model’s potential to provide reliable data continuity in remote sensing, particularly in regions and seasons where cloud cover frequently obscures the earth’s surface. The ability of our model to forecast images on user-defined dates, including during adverse weather conditions, represents a substantial advancement in the field of agricultural management and environmental monitoring.

Future research will aim to extend this work by focusing on the prediction of missing multiband images over entire seasons, further enhancing the model’s ability to handle diverse and complex environmental conditions.

References

[1] Florian Gerber, Rogier de Jong, Michael E Schaepman, Gabriela Schaepman-Strub, and Reinhard Furrer, “Predicting missing values in spatio-temporal remote sensing data,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 5, pp. 2841–2853, 2018.
[2] Dipankar Mandal, Mehdi Hosseini, Heather McNairn, Vineet Kumar, Avik Bhattacharya, YS Rao, Scott Mitchell, Laura Dingle Robertson, Andrew Davidson, and Katarzyna Dabrowska-Zielinska, “An investigation of inversion methodologies to retrieve the leaf area index of corn from c-band sar data,” International Journal of Applied Earth Observation and Geoinformation, vol. 82, pp. 101893, 2019.
[3] W Zhao, F Yin, H Ma, Q Wu, J Gomez-Dans, and P Lewis, “Combining multitemporal optical and sar data for lai imputation with bilstm network,” arXiv preprint arXiv:2307.07434, 2023.
[4] Thomas Roßberg and Michael Schmitt, “Dense ndvi time series by fusion of optical and sar-derived data,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2024.
[5] Luca Pipia, Jordi Muñoz-Marí, Eatidal Amin, Santiago Belda, Gustau Camps-Valls, and Jochem Verrelst, “Fusing optical and sar time series for lai gap filling with multioutput gaussian processes,” Remote Sensing of Environment, vol. 235, pp. 111452, 2019.
[6] Baili Chen, Hongwei Zheng, Lili Wang, Olaf Hellwich, Chunbo Chen, Liao Yang, Tie Liu, Geping Luo, Anming Bao, and Xi Chen, “A joint learning im-bilstm model for incomplete time-series sentinel-2a data imputation and crop classification,” International Journal of Applied Earth Observation and Geoinformation, vol. 108, pp. 102762, 2022.
[7] Qiang Zhang, Ruiqi Wang, Ying Qi, and Fei Wen, “A watershed water quality prediction model based on attention mechanism and bi-lstm,” Environmental Science and Pollution Research, vol. 29, no. 50, pp. 75664–75680, 2022.
[8] Lynn Miller, Charlotte Pelletier, and Geoffrey I Webb, “Deep learning for satellite image time-series analysis: A review,” IEEE Geoscience and Remote Sensing Magazine, 2024.
[9] Qunming Wang, Yijie Tang, Yong Ge, Huan Xie, Xiaohua Tong, and Peter M Atkinson, “A comprehensive review of spatial-temporal-spectral information reconstruction techniques,” Science of Remote Sensing, p. 100102, 2023.
[10] Heshan Wang, Yiping Zhang, Jing Liang, and Lili Liu, “Dafa-bilstm: Deep autoregression feature augmented bidirectional lstm network for time series prediction,” Neural Networks, vol. 157, pp. 240–256, 2023.