3.2. Vector Embedding Layer
The word vector input of the above studies was relatively single, and this paper intends to fuse word information and word position information as the input of the embedding layer. If the length of each sentence is , when is greater than , the words with the length before are intercepted. When is less than , the sentence is completed by adding zero at the end of the sentence. The proposed model maps each word in a sentence into a vector ; then, the sentence sequence can be represented as , where each vector contains two parts of embedding sum, word vector encoding, and position vector encoding.
(1) Tok, in
Figure 2, stands for word vector encoding. The words are vectorized to represent
, and each comment sentence is shown in (1) after being vectorized.
Here, represents the dimension of the word vector, represents the number of words in each review, and represents the -th word vector in the text .
(2)
represents the current position encoding vector, and the position encoding dimension is consistent with the Tok word vector dimension. By adding Tok and the position encoding vector together, the added word has the word information with the position. The position vector is obtained by calculating formulas such as (2) and (3).
refers to the position of the word in the sentence. is the position parameter, the even position of the word in the sentence is calculated using Formula (2), the cardinal position is calculated using Formula (3), and refers to the word vector dimension.
3.3. Multi-Channel BLTCN-BLSTM Self-Attention Network Model
After the vectorized representation of the text, it was input into channel one and channel two simultaneously. The multi-channel BLTCN-BLSTM self-attention network model is used to grasp multi-level and multi-dimensional semantic information. The TextCNN channel extracts different levels of feature expressions through three different convolution kernels. The Bi-LSTM bidirectional channel mainly solves the problem of timing and so on.
As shown in
Figure 2, the multi-channel BLTCN-BLSTM self-attention model is composed of bidirectional dynamic encoding, convolution, cross-channel feature fusion layer, and self-attention layer.
(1) TextCNN channel: The TextCNN model applies the CNN model to the text. Its structure includes the input layer, multiple convolution layer, pooling layer, fully connected layer, and output layer. In processing text tasks, due to the special form of text, the way of convolution is generally one-dimensional convolution, and the width of the convolution kernel is consistent with the dimension of the word vector. The N-gram local features are extracted by a convolution operation. TextCNN channel takes the Vector embedding layer vector as the input of the convolution channel, and the dimension of the word vector is
. The convolution layer uses
sliding windows of different sizes to perform convolution operations on the text input vector to learn text features. The
convolution kernel sizes are 2, 3, 4, respectively. Each filter_size has a different channel, which is processed to obtain the regional feature map. After calculating the dimension reduction of each feature map by the Max pooling method through the pooling layer, the channel vectors are merged into a whole through the connection operation, which is used as the output of the TextCNN channel. The eigenvalues are obtained by the convolution kernel at position
, and the formula is as in (4).
Here,
represents the vector dimension corresponding to each word in the text sequence,
represents the convolution kernel with dimension size
, and
represents the sliding window consisting of row
to row
of the input matrix.
denotes the bias term parameter and
denotes the nonlinear mapping function. After the word set under each convolution window, the Max pooling method [
35] is used to reduce the feature dimension by the vector–matrix pooling operation, as shown in
Figure A1. The calculation formula is as follows:
, where
represents the feature vector obtained by convolution operation, and finally, the vector splicing is carried out.
(2) Bi-LSTM bidirectional channel: Although the multi-channel TextCNN can effectively extract local features, it cannot consider the temporal features of the sentence at the same time. To make up for this defect, the LSTM long short-term memory network model was introduced. LSTM can be used to process the temporal information of sentences. By integrating the input information of the history unit and the current time unit, the “memory” unit that reflects the global information is generated after the operation. However, LSTM found the problem that information cannot be effectively transmitted in long-term experiments, and the main reason is that the results of semantic encoding by this network are biased toward the semantics corresponding to the last words in the text [
36]. This problem can be effectively solved by designing a bidirectional memory network. The structure of Bi-LSTM is shown in
Figure A2, and it is jointly trained by a forward and a backward, respectively, taking forward and backward information, as shown in Formulas (5)–(11).
Here, , , and denote the parts of the forget gate, input gate, and output gate at time forward, respectively. is the activation function . is the input gate candidate cell. represents the output of the forward memory control unit after updating at time . and are the weight matrices of the forward class. denotes the offset vector of the forward class. Backward , as in the formula, is the same as forward. represents the input vector, forward is learned from to at time , backward is learned from to at time , and and are concatenated to obtain the final hidden layer representation .
(3) Self-attention layer: Self-attention is a variant of attention called internal attention. Its advantage is that it can directly calculate the dependencies between vectors without external additional information, which can better learn the internal structure of sentences. The vector
output from the CNN channel and Bi-LSTM bidirectional channel is input into the self-attention layer to learn the dependencies within the sequence and the weights of different word vectors.
where
is the input information of the self-attention layer; then, the scaled dot product operation is used,
is the dimension of the embedded word vector, and the calculation formula is as shown in (12).
The structure of the self-attention mechanism is shown in
Figure 3. The self-attention mechanism usually adopts the Query-Key-Value (QKV) method, where the source of
,
, and
is the same input. Let the input matrix be
, and the matrices
,
, and
are obtained by different matrix transformations. Firstly, the similarity is calculated by the transpose multiplication of
matrix and
matrix, and the obtained matrix is input into
for normalization. Finally, the normalized matrix is multiplied by the
matrix to obtain the calculation results of the self-attention layer.
(4) Connected layer: The information fusion operation is adopted for the high-level features that pass through multiple channels to obtain the fused text feature representation
. The calculation formula is as follows (13).
is the text feature vector output by the Bi-LSTM bidirectional channel after attention, is the text feature vector output by the TextCNN channel after attention, and the fused text feature vector is classified by the L-Softmax classifier.
By constructing the multi-channel BLTCN-BLSTM network, respectively, the text features are extracted by combining semantic knowledge, which solves the local feature extraction, and effectively ensures the distance dependence and the correlation between attributes. Then, the learned feature vectors are given different weights through the self-attention layer to strengthen the learning of emotional semantic features.
3.5. Loss Rebalancing
After the L-Softmax layer, Focal Loss was fused again. By reducing the contribution value to the loss of easy text samples and increasing the contribution value to the loss of difficult samples, the optimal parameters were found to solve the model performance problems caused by the imbalance of overall data categories. The calculation formula is given in (15) and (16).
where
represents the probability that the sample is predicted to be a certain class,
represents the class label of the sample, the weighting factor
can balance the weight of the difference between the number of positive and negative samples on the total loss,
is the modulation factor, and the focusing parameter
reduces the loss of easily divided samples. Focal Loss algorithm, when the sample is correctly classified, the larger
is, the higher the classification confidence is, indicating that the sample is easy to classify. The smaller
is, the lower the classification confidence is, which means the sample is difficult to distinguish. The experiment found that when the parameters were
and
, the experimental effect was the best; for example, if
, then
, and this sample easily distinguishes that the loss value contribution decreases; if
, then
, and it is hard to distinguish that the loss value contribution increases in this sample.