Computer algorithms can only work with numerical data. To process text data on a computer, the text must be represented in numerical form. Text representation is a crucial process in the field of NLP. Good word representation can encode text more effectively and improve classification performance. To process sentiment classification models using neural networks, it is essential to convert text into a numerical format. This process is very important because the correct text representation will make the model more accurate and its accuracy values better. In this section, we explain in more detail the model we propose regarding fusion text representation. We describe each layer used in this proposed model.
3.2.1. Embedding Layer
GloVe and Word2Vec are two embedding techniques that transform words into high-dimensional vectors that encapsulate the semantic relationships and contextual nuance between words. Glove leverages co-occurrence statistics from a corpus to generate word vectors, capturing both local and global statistical information. In contrast, word2vec employs a shallow, two-layer neural network to produce embeddings by predicting word context (Skip-gram) or the current word based on its context (Continuous Bag of Words, CBOW). In the context of neural networks, these word embeddings serve as foundational inputs, providing a dense, continuous representation of words that facilitate better performance on various NLP tasks, especially in sentiment classification. First, we represent the embeddings for both word embeddings as .
Here,
denotes the combination embedding matrix, where
and
are the individual embedding vectors generated by GloVe and Word2Vec models, respectively. This combined representation is utilized as input for subsequent neural network models to leverage the strengths of both embedding techniques.
is the embedding matrix derived from GloVe model. The GloVe algorithm constructs word vectors by factoring in the co-occurrence probability of words withing a large corpus. Specifically, it builds a co-occurrence matrix
, where each entry
indicates how often word
appears in the context of word
. The model then minimizes a weighted least squares objective function to learn word vectors
and context
such that their dot product approximates the logarithm of the word–word co-occurrence probabilities:
where
is the objective function to be minimized,
is the vocabulary size (the number of words in the corpus),
is the word vector for word
,
is the context vector for word
, and
and
are the bias terms associated with the word and context, respectively.
is the entry in the co-occurrence matrix that indicates how frequently word
appears in the context of word
.
is the weighting function used to assign smaller weights to word pairs with very high or very low co-occurrence frequencies.
In GloVe,
is a weighting function used to adjust the contribution of word pairs based on their co-occurrence frequency
. This function plays a crucial role in reducing the influence of very high or very low frequencies, making the model more stable in learning word representations. The function
is generally defined as follows:
where
is the maximum threshold value, beyond which the contribution of co-occurrence frequency does not increase.
is a parameter that controls the shape of the weighting function (usually set between 0.5 and 1). This
function ensures that word pairs with very low frequencies do not contribute too little, and those with very high frequencies do not dominate the training process.
Meanwhile,
represents the embedding matrix obtained from the Word2vec model. Word2vec offers two main architectures for generating word embeddings: Skip-gram and Continuous Bag of Words (CBOW). In the Skip-gram model, the algorithm predicts the context words given a target word within a fixed window size. The objective of skip-gram is to maximize the average log probability:
where
is the objective function to be maximized,
is the total number of words in the corpus, and
is the window size, representing the number of context words on each side of the target word
.
is the log probability of the context word
given the target word
. In the CBOW model, the algorithm predicts the target word based on the context word within the window. The objective is to maximize the probability of the target word given the surrounding context words:
In both architectures, the probabilities
are defined using the softmax function:
where
and
are the input and output vector representations of words. Through this training process, Word2Vec learns word vectors that capture semantic and syntactic relationship based on the context in which words appears. By combining
and
into the embedding matrix
, we can harness the complementary strengths of both embedding techniques, leading to richer and more informative word representations for downstream neural network model.
In our proposed model, let
be the vocabulary size and
be the embedding dimension; then, the embedding matrix for GloVe can be represented as
. Here, each word
is represented by the embedding vector
. Meanwhile, Word2Vec is a neural network-based embedding method that learns word representations by maximizing the probability of neighboring words in a given context. The formula for the embedding matrix in Word2Vec can be represented
. Here, each word
is represented by the embedding vector
. Once all the embedding matrices are ready, the next step is to build a fusion model to combine these two embedding matrices. Suppose we have input text
that has been processed into a sequence of words
, where
is the maximum length of the sequence. The model for each embedding can be seen below:
with
with
. Once all the embedding matrices are ready, the next step is to build an embedding fusion, which is a combination of
and
.
with
, where concatenation is performed in the embedding dimension so that each word is represented by a vector embedding of dimension
. Mathematically, the embedding layer transforms the sequence of words
into a sequence of embedding vectors
:
where
, is the concatenated embedding for the word
, and
denotes the concatenation operator.
3.2.2. biGRU Layer
A Gated Recurrent Unit (GRU) is a type of recurrent neural network (RNN) designed to efficiently capture dependencies in sequential data while mitigating the vanishing gradient problem that traditional RNNs often face. The GRU unit employs two gates—reset and update gates—to regulate the flow of information and maintain long-term dependencies. The hidden state
of the GRU at time
is computed based on the current input
and the previous hidden state
and can be mathematically represented as follows:
Building on the GRU’s ability to capture dependencies in sequential data, the bidirectional GRU (biGRU) enhances this capability by processing the data in both forward and backward directions. This bidirectional approach allows the model to harness contextual information from both past and future sequences, providing a richer understanding of the data.
In our model, after the embedding layer process, the sequence of embeddings
is passed through a biGRU layer. biGRU consists of two GRU layers (forward and backward) that process data from start to end and from end to start, respectively. Suppose
is hidden state of the forward GRU at time
; then, the forward GRU is defined as follows:
Meanwhile, if
is the hidden state of backward GRU at time
, then the backward GRU is defined as follows:
Therefore, the output of the biGRU at time
is the combination of the forward and backward hidden states: