This paper explores the spatial correlation discovery and mining of SST data from four aspects, which include the regular boundary division for spatial interference elimination, convolutional sliding translation for spatial feature focusing, the clustering neural network for spatial feature extraction, and the graph convolutional neural network (GCN). The GCN part will also build the graph convolutional neural network, construct the graph structure of SST data, and, on this basis, design a spatiotemporal fusion prediction model (GCN-LSTM) by combining the GCN and LSTM approaches to improve the accuracy of SST prediction.
2.2. Convolutional Sliding Translation for Spatial Feature Focusing
The convolution operation of the convolutional neural network is similar to the division mode in
Section 2.1, and no explicit division operation is required. The convolutional neural network will arrange the spatial information according to certain convolutional windows and integrate and refine the data in each window through the sliding translation of the convolutional window, so as to further realize spatial feature focusing and mine more feature information to improve the accuracy of SST prediction.
The two core concepts of the convolutional neural network are the convolution kernel K and the size of the convolution step S, which include the size of the horizontal and vertical directions, defined as Kh, Kv, Sh, and Sv, respectively. The convolution kernel is the window of the convolution operation. The convolution step indicates how the convolution kernel window moves and the distance of each move. The shape of the SST data is set as (Ph, Pv, X), where Ph is the number of horizontal spatial points, Pv is the number of vertical spatial points, and X is the size of dimension for the SST data. The convolutional operation of a convolutional neural network will operate on a two-dimensional plane (Ph, Pv).
With the definition of the convolution kernel and step size, the size of the output in the horizontal and vertical directions after the convolution operation can be calculated. The equation for calculating the output size in the horizontal direction is Equation (1).
where
Oh is the size of the horizontal output dimension,
Ph is the number of horizontal spatial points in the selected sea area,
Kh is the size of the horizontal direction of the convolution kernel, and
Sh is the size of the horizontal convolution step. The equation for calculating the output size in the vertical direction is Equation (2).
Finally, the output shape of the SST data (Ph, Pv, X) after passing through the convolutional neural network will be (Oh, Ov, F), where F is the number of the convolution kernel. The convolutional neural network fully processes and explores the input data in the spatial dimension. The output of the convolutional neural network can be fed into the LSTM model for further training in the temporal dimension on the basis of the spatial feature extraction, which is a powerful supplement to the spatial dimension for LSTM. Thus, the accuracy of SST prediction will be improved.
2.3. Spatial Feature Extraction by the Clustering Neural Network
Although the regular boundary division and convolutional sliding translation can play certain roles in improving SST prediction, they still have limitations because they are only based on the spatial points near each other, while there will be similarities and similar rules among the spatial points far away from each other. The clustering neural network can solve this problem to a certain extent. It will analyze and mine the data of spatial points to extract their spatial features and divide the spatial points into different groups according to the captured rules and similarities. Clustering is an unsupervised training mode that does not need to provide labels, which is in line with the need to find a better method for mining spatial correlations.
Self-organizing mapping [
44] (SOM) is a commonly used unsupervised neural network for clustering. Each input of the input layer maps to a node of the hidden layer, while the output neurons compete with each other to be activated, and the neurons generate the final result in a self-organizing way, so it is called self-organizing mapping.
The SOM network consists of four main parts: initialization, competition, cooperation, and adaptation. Given an
M-dimensional input
X = {
xi:
i = 1, 2, …,
M}, the connection weight between the node
i of the input layer and the neuron
j of the computing layer can be expressed as
Wj = {
wji:
j = 1, 2, …,
N}, where
N is the number of neurons in the computational layer. The first initialization section initializes
Wj to a relatively small connection weight tensor randomly. The competition process will find the neuron that matches the input using the Euclidean distance discriminant function, which is Equation (3).
The neuron whose weight tensor is closest to the input tensor will be chosen as the winner. Once one neuron is selected, the probability of its neighboring neuron being selected is greatly increased. By setting
I(
X) as the index of the winning neuron, the topological neighbors of
I(
X) can be identified by Equation (4).
where
Sj,I(X) represents the distance between neuron
j and the winning neuron
I(
X), and the parameter
σ is the neighbor radius used to control the neighbor scope. The following adaptation process adjusts the weights of the winners and their topological neighbors. The weight adjustment equation is Equation (5).
where
t represents an epoch, and
η(
t) represents the learning rate of epoch
t. The weight update of each epoch makes the weights of the winning neuron and its neighbors closer to the input tensor, and the process is repeated until convergence is achieved.
For spatial feature extraction by the clustering neural network, the SOM neural network is first used to cluster the spatial points of the SST data and to divide them into multiple groups. Compared with other points in the sea area, the SST data of spatial points within each group have stronger similarity.
Next, clustering results can be used as the spatial correlation to improve SST prediction, which can be implemented in two ways. Taking LSTM as an example, the first way is to input the clustering results as a new feature into the LSTM model with SST data. The second way is to use LSTM to train each group. The first approach focuses on providing the LSTM model with more spatial information in the spatiotemporal dimensions. The second method aims to reduce the influence of the spatial points with weak correlations, make the LSTM model focus on the points with strong correlations in the spatial dimension, and mine the values and rules of SST data in the temporal dimension so as to improve the SST prediction.
2.4. Graph Convolutional Neural Network
Although spatial feature extraction by the clustering neural network solves the problem of not being able to handle non-adjacent nodes, it still has obvious boundaries between groups, and the connections between nodes in different groups are ignored to a certain extent. Therefore, a new model is needed that not only has the advantages of spatial feature focusing and spatial feature extraction but that can also mine the information between nodes in remote or different groups. It will further improve the effect of spatial correlation and, finally, optimize and improve the accuracy of SST prediction.
As long as there is a strong connection between two nodes, an edge can be added. The graph convolutional neural network (GCN) divides the interconnected points into a group, so it does not require explicit division and also does not depend on the regular Euclidian space. In other words, it uses an edge to replace the concept of a group. Any two nodes with a strong connection will have an edge, so there is no explicit boundary between groups, which addresses the problem of the lack of spatial correlation around the boundary of groups. Therefore, the GCN is a more perfect mechanism for mining spatial correlations.
The graph data structure mainly includes nodes, node data, the adjacency matrix, and the degree matrix. The node represents a spatial point in the graph, the node data are the dataset of each node, the adjacency matrix is used to indicate whether there are edges between any given two nodes, and the degree matrix represents the number of edges of each node. Because of the equivalence between two different spatial points in the sea, the graph discussed in this paper is undirected. Next, we will show how the GCN is built through the graph data structure described above.
Given a graph
G, with node number
N and node data
X ∈
RN×M, where
M is the dimension of the node data, the adjacency matrix is
A ∈
RN×N, and the degree matrix is
D ∈
RN×N. Because the node of a graph does not have an edge to itself, the diagonal values of the adjacency matrix
A from the top left to the bottom right are zero. In the field of neural networks, the nodes themselves also play a crucial role, so each node should have an edge to itself. Therefore, in order to apply the graph data structure to the neural network, it is necessary to add the identity matrix
IN to the adjacency matrix
A to form a new adjacency matrix, and at the same time, the identity matrix
IN should also be added to the degree matrix to form a new degree matrix. The new
A and
D can be expressed by Equation (6).
If there is an edge between two nodes, it indicates that there is a correlation between them. According to the idea of convolution, the information of the nodes with correlation to a node can be merged into this node, so that there is more relevant information for the neural network to learn. Merging information from other nodes into the node itself can be obtained by multiplying à by X, i.e., ÃX. Because the information from the associated nodes is added to the node, the data values of the node with a large degree may become large, while the data values of the node with a small degree are relatively small. The neural network is sensitive to the size differences in the input data, which may cause gradient explosion or gradient disappearance.
Thus, the sum operation can be replaced by an average operation, and the degree of a node represents the number of its associated nodes, so it can be converted to an average value via dividing it by the value of its degree. For the entire graph, the degree matrix can be used here to solve the problem. The average values of the nodes of the entire graph can be realized by left multiplying
ÃX by the inverse matrix
of the degree matrix. This can be represented by Equation (7).
As we can see from Equation (7),
is the normalization for the rows of the adjacency matrix
à by dividing the value of each node in the row
i by
, and
à is a symmetric matrix, so it should have the same operation for column in order to obtain better results. This can be achieved by right multiplying
à by the inverse matrix
, as shown in Equation (8).
At this time, another problem emerges: for the element
of adjacency matrix
Ã, it is normalized twice, that is,
is divided by
, which will certainly affect the predictive effect. To solve this problem, we can change
to
. In the matrix form, this changes
to
. After the change,
will be only normalized once. So far, for the entire graph, how the neighbor information of each node is aggregated to itself can be represented by Equation (9).
From the perspective of the graph convolutional neural network, for a hidden layer
Hl, the feature transfer between the nodes of the layer
Hl can be realized through Equation (9), and the features of the neighbors of the nodes can be passed to the nodes themselves to realize the convolution operation. On the basis of Equation (9), adding an activation function and trainable weight for nonlinear transformation will allow the next hidden layer
Hl+1 to be obtained. Therefore, the propagation rule for the hidden layer of the graph convolutional neural network is expressed as Equation (10).
where
σ is the activation function, and
Wl is the trainable weight parameter.
From the above analysis, it can be seen that the graph convolutional neural network can process irregular data shapes, that is, the graph data structure, and integrate the features of the connected nodes through the convolution operation to fully explore the rules in the spatial dimension, thus breaking the limitations of regular boundary division, convolutional sliding translation, and clustering neural network. Therefore, the graph convolutional neural network is more suitable for training and prediction using SST data and is able to improve SST prediction.
2.5. Construction of the Graph Data Structure for SST Data
In order to use the graph convolutional neural network for training and SST prediction, it is necessary to form the SST data into a graph structure. As described in
Section 2.4, the three most important components of the graph structure are the nodes, the node data, and the edges. The data part is the time-series data of SST. Next, the nodes and the edges between the nodes will be constructed.
The approach assumes that the selected sea area contains
P spatial points, and the time-series data include the data of
D days. These
P points are the nodes of the graph structure, and each node will contain SST time-series data. The data of the entire sea area and the schematic diagram represented by nodes with their data are shown in
Figure 1. Taking one data feature as an example, if there are multiple data features, it is only required to change the data of each node from one dimension to multiple dimensions.
For the nodes in
Figure 1, as long connections are added between the nodes, that is, the edges in the graph, it can transform the SST data into a graph data structure. In graph data structures, edges are represented by an adjacency matrix. Therefore, we need to find a way to generate the adjacency matrix. The purpose of this paper is to explore the law of spatial correlation and to integrate the information for the connected spatial points during model training and prediction so as to improve the prediction accuracy. In the field of SST data, it is to find the method used to identify nodes with a strong correlation relationship.
The easiest way is to calculate the correlation relationship using the distance between two spatial points. However, using the distance will have some limitations, because the time-series data of the spatial points are not considered, and the SST data of the two spatial points relatively far away may also have similar rules. Therefore, this paper determines whether there is an edge between the two points by defining a threshold and checking if the value of the correlation coefficient (
r) is greater than it. The value of
r reflects the correlation of data between spatial points and can be used to judge the strength of the relationship between them. The definition of
r is shown as Equation (11).
By using
r to determine the strength of the relationship between two nodes, the adjacency matrix
A of the graph can be expressed as Equation (12).
where
rij is the value of
r between the
ith and
jth node, and α is the threshold of
r where there is an edge between two nodes. The range of α is (0, 1); in general, a value close to 1 will be taken. The adjacency matrix is calculated by Equation (12). The
r value between the node and itself is 1, and the parameter α is less than 1, so the adjacency matrix already contains the identity matrix. It is not required to add the identity matrix to
A, and the graph convolutional neural network can directly use the adjacency matrix
A.
An edge between spatial points in the sea area will be added through the adjacency matrix, and the two points with a strong correlation will establish a connection. In this way, the graph structure of SST data is formed through nodes and the adjacency matrix. Now, we have the graph data structure of the SST data, which can be expressed as Equation (13).
where
X is the node and data of the graph (
X ∈
RP×D), and
A is the adjacency matrix of the graph (
A ∈
RP×P).
P is the number of spatial points, and
D is the number of days of the time-series data. The graph structure will be used as the input for the graph convolutional neural network to realize the analysis and mining in spatial dimensions for SST and to improve the accuracy of SST prediction.
2.6. The Spatiotemporal Fusion Model for SST Prediction Based on the GCN and the LSTM
It can be seen from
Section 2.4 that the graph convolutional neural network (GCN) can fully implement feature extraction in the spatial dimension, but it has no special advantages for processing time-series data. LSTM is a deep learning model specially used to deal with time-series data. Therefore, combining the GCN and the LSTM can create a spatiotemporal fusion model, GCN-LSTM, for SST prediction, which will integrate the advantages of neural networks in the spatial and temporal dimensions.
In order to seamlessly integrate the GCN and LSTM models, we need to first adjust the SST graph data structure
G = (
X,
A) described in
Section 2.5. The shape of the node and data
X is (
P,
D), and the shape of the adjacency matrix
A is (
P,
P). On this basis, the time step
T of the LSTM model is introduced, the shape of
X is adjusted to (
P,
D,
T), and the shape of the adjacency matrix
A is unchanged. After the time step is added, each time step has a separate graph. The GCN will train all graphs corresponding to the time steps. If the size of the prediction window
F is taken into account as prediction results, the number of graphs will be generated based on
F after the training and prediction are completed. After adding time step, the diagram of the GCN is shown in
Figure 2.
As we can see from
Figure 2, multiple SST graphs are trained through the hidden layers of the GCN and related nonlinear transformations, and the graphs, as the prediction results, will be generated based on the size of the prediction window
F.
After the input data of the GCN is adjusted and the LSTM is added, the model becomes the proposed spatiotemporal fusion model for SST prediction, GCN-LSTM. The GCN-LSTM model is composed of four parts. The first part is data preprocessing and graph structure construction. The second part uses the GCN to train the SST graphs with time steps to realize spatial feature extraction. The third part feeds the training results of the GCN into the LSTM for further time-series data processing. The last part generates the final prediction results through the fully connected layer. The structure of the GCN-LSTM model is shown in
Figure 3.
In
Figure 3,
P is the number of nodes of the SST graph and the number of spatial points in the selected sea area, and
D is the number of days of the SST time series for each spatial point. The shape of the initial input data of the model is (
P,
D). First, the adjacency matrix is constructed for the model according to Equation (12) in
Section 2.5, and the shape of the adjacency matrix is (
P,
P). Then, the time step of the LSTM is introduced to construct the input shape of the LSTM, and the shape of the input data is adjusted from (
P,
D) to (
P,
D,
T) as nodes and node data of the graph structure. After the nodes, data, and adjacency matrix are obtained, the graph data structure of SST data is constructed by combining them. The shape of each batch of data is (
P,
B,
T), which is the input for the GCN. After convolutional and nonlinear transformations, the shape of the GCN output is the same as that of the input, so it is still (
P,
B,
T). It completes the feature extraction in the spatial dimension. Next, the graphs trained by the GCN will be input into the LSTM to conduct the feature analysis and extraction in the temporal dimension. The output shape is (
N,
B,
T), where
N is the number of hidden units in the LSTM. In order to generate the final prediction result of the graph, it is required to flatten (
N,
B,
T), reshape it to two dimensions, and adjust its shape to (
B,
T ×
N). Since the number of nodes in the final graph is
P, another fully connected layer is needed here to further adjust the output shape to (
B,
P). At this point, a batch of training and prediction data in the temporal dimension has been completed. The dataset is divided into 70% data as the training data and 30% data as the test data. Therefore, when all batches are trained and predicted, the final output result of the GCN-LSTM model is (
P, 0.3
D). It represents the final graph as the prediction result, which contains
P nodes, and the size of the SST data for each node is 0.3
D in the temporal dimension.
The GCN-LSTM model changes the input graph from one to multiple graphs according to the time step of the LSTM. For the graph of each time step, the feature extraction and mining in the spatial dimension are fully conducted through the spatial correlation identified by the GCN, so that the information between nodes with edges can be transferred and integrated with each other. Then, the LSTM is used to further mine the features of the SST data in the temporal dimension. The GCN-LSTM model integrates the advantages of the GCN and LSTM models, and this will significantly improve the accuracy of SST prediction.