Fusion Text Representations to Enhance Contextual Meaning in Sentiment Classification

Trisna, Komang Wahyu; Huang, Jinjie; Liang, Hengyu; Dharma, Eddy Muntina

doi:10.3390/app142210420

Open AccessArticle

Fusion Text Representations to Enhance Contextual Meaning in Sentiment Classification

¹

School of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, China

²

Department of Informatic Engineering, Faculty of Information Technology and Desain, Primakara University, Denpasar 80226, Bali, Indonesia

³

Key Laboratory of Advanced Manufacturing and Intelligence Technology, Harbin University of Science and Technology, Harbin 150080, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(22), 10420; https://doi.org/10.3390/app142210420

Submission received: 30 September 2024 / Revised: 7 November 2024 / Accepted: 8 November 2024 / Published: 12 November 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Sentiment classification plays a crucial role in evaluating user feedback. Today, online media users can freely provide their reviews with few restrictions. User reviews on social media are often disorganized and challenging to classify as positive or negative comments. This task becomes even more difficult when dealing with large amounts of data, making sentiment classification necessary. Automating sentiment classification involves text classification processes, commonly performed using deep learning methods. The classification process using deep learning models is closely tied to text representation. This step is critical as it affects the quality of the data being processed by the deep learning model. Traditional text representation methods often overlook the contextual meaning of sentences, leading to potential misclassification by the model. In this study, we propose a novel fusion text representation model, GloWord_biGRU, designed to enhance the contextual understanding of sentences for sentiment classification. Firstly, we combine the advantages of GloVe and Word2Vec to obtain richer and more meaningful word representations. GloVe provides word representations based on global frequency statistics within a large corpus, while Word2Vec generates word vectors that capture local contextual relationships. By integrating these two approaches, we enhance the quality of word representations used in our model. During the classification stage, we employ biGRU, considering the use of fewer parameters, which consequently reduces computational requirements. We evaluate the proposed model using the IMDB dataset. Several scenarios demonstrate that our proposed model achieves superior performance, with an F1 score of 90.21%.

Keywords:

word embedding; sentiment classification; deep learning

1. Introduction

One of the topics widely developed by researchers in the field of Natural Language Processing (NLP) today is sentiment classification. Sentiment classification plays an important role in the current digital world. One of its uses is to assist in making decisions for companies. Nowadays, the internet and social media have become essential parts of human life. All types of information that people want to know can be easily accessed through the internet. In fact, today if someone wants to buy something, they only need a mobile phone to have the desired item easily delivered to their address. Not only can people today shop online, but they can also easily find and give reviews about a product, such as reviews of items they have bought from an e-commerce site, restaurants they have visited, or even movies they have watched. With the abundance of reviews available on related platforms, users no longer need to search for or even ask someone about their opinions on a product before buying it. With the reviews already available, a person only needs to sit and look at their phone to decide whether the product meets their expectations or not. Surveys indicate that users will look at reviews first before buying and deciding on a product [1].

On the other hand, these reviews are also very beneficial for the service providers or businesses. They will utilize data from customer reviews to make business decisions. These business decisions can include continuing to produce a product or making decisions to improve the quality of their products and services to consumers. The process of analyzing reviews refers to analyzing the review texts based on customer emotions and obtaining trends in customer emotions. However, as the number of reviews on online platforms increases, the resulting data also grow, making it nearly impossible to manually search and analyze these reviews. This is where the role of sentiment classification comes in. Sentiment classification of a review involves classifying the text based on the tendency of emotions as positive, negative, or neutral. The initial stage of developing sentiment analysis processes was largely carried out using lexicon and machine learning models such as Naïve Bayes [2], SVM [3,4], and Decision Tree [5,6]. However, these models require manual human assistance in the feature extraction process. With the advancement of research in sentiment classification, deep learning [7] has become a more promising model compared to lexicon and machine learning models. Unlike machine learning, deep learning does not require human intervention to find and form features. Deep learning automatically extracts features from different neural network models. This makes deep learning models more widely used today.

A widely used variant of deep learning suitable for text due to its sequential nature is Recurrent Neural Networks (RNNs) [8], which consist of Long Short-Term Memory (LSTM) [9] and Gated Recurrent Units (GRUs) [10], as well as their bidirectional variants, namely Bidirectional Long Short-Term Memory (biLSTM) and Bidirectional Gated Recurrent Unit (biGRU). LSTM and GRU are both effective in addressing the vanishing gradient problem and capturing long-term dependencies. GRU has a simpler architecture compared to LSTM because it only uses two gates (reset gate and update gate) compared to the three gates used in LSTM (input gate, output gate, and forget gate). Due to this simpler structure, GRU requires less memory and computation, making it faster to train and more efficient, especially on large datasets or models with many layers. Additionally, with fewer parameters, GRU can reduce overfitting when the amount of data and training is limited. For these reasons, we chose GRU to be used in our proposed model, specifically the Bidirectional GRU (biGRU) type. The use of deep learning models in the sentiment classification process is closely linked to the text representation process. A commonly used text representation in deep learning models is word embedding [11]. Word embedding plays a crucial role in the field of NLP. Machines cannot process data in text form directly, so in the case of sentiment classification, the input, which is text, must be converted into a form that machines can process, which is word vectors. This is where the role of word embedding becomes essential.

Pre-trained word embeddings commonly used in NLP research models are Word2Vec [12] and GloVe [13]. Both have advantages in capturing semantic relationships within sentences. Word2Vec is capable of capturing local context, while GloVe can capture global context. Combining both can result in richer and more robust semantic representations that leverage both local and global contexts. Additionally, both Word2Vec and GloVe suffer from OOV (Out-Of-Vocabulary) issues, where words not seen during training do not have pre-trained embeddings. Combining the embeddings can help mitigate OOV issues if one model captures certain words that the other does not. To enhance the quality of text representation in sentiment classification models, we propose a model named GloWord_biGRU. This model attempts to combine the advantages of both context-independent word embedding models and then uses the biGRU model for sentiment classification. By combining the advantages of both pre-trained word embeddings, the resulting model can produce richer text representations and achieve higher levels of accuracy.

2. Related Work

This section will discuss previous research in the field of sentiment classification, including research related to sentiment itself, research related to word embedding, and the connection of research with deep learning models.

2.1. Sentiment Classification

In the rapidly developing digital era, information is not only widespread in large quantities but also in various forms, including text, images, and videos. One of the main challenges in managing this information is understanding the sentiment or feelings contained in a piece of text, especially to determine the author’s attitude toward a particular topic, whether positive, negative, or neutral. Sentiment classification, also known as opinion mining [14], is a process that uses NLP techniques, text analysis, and computational linguistics to identify and extract subjective information from text sources. Its main goal is to determine the author’s attitude toward a particular subject, whether it is a product, service, organization, individual, or other topics. Sentiment classification can be applied in various applications, one of which is product reviews [15]. Business use sentiment classification to understand customer perceptions of their products, enabling them to make improvements and develop products further. In social media analysis, it can help companies and individuals understand public opinion about a brand, service, or specific issue and can be used to monitor and respond to customer feedback effectively to enhance customer satisfaction and loyalty. Additionally, sentiment analysis can be used in the political field [16] to gauge public opinion on candidates, policies, or specific political issues to aid in political campaigns and communication strategies.

With the various benefits that can be applied by analyzing sentiment, there are also challenges that arise. Sentiment classification works using language, which often has ambiguities such as irony and sarcasm that are difficult to detect manually. Conducting sentiment classification also heavily depends on the quality of the data and the training process. Non-representative or biased data can result in inaccurate models. To implement a sentiment classification model, several approaches can be taken. The first approach is lexicon-based [17], which uses a list of words that have been labeled with sentiments (positive, negative, or neutral). Each word in the text is compared with this list to determine the overall sentiment. The second approach is machine learning, which involves training a machine learning model on a labeled dataset to classify new text. Commonly used algorithms include Naïve Bayes, Support Vector Machine (SVM), and neural network-based approaches such as Recurrent Neural Network (RNN) and Transformer [18].

RNN is a type of deep learning model widely used to solve problems in sequence models such as text. Nowadays, deep learning is highly favored by researchers to address text-related problems, including text classification and sentiment classification. Unlike traditional methods that use classical approaches in text processing, such as bag-of-words or TF-IDF, which require manual feature extraction, deep learning employs deep neural networks capable of automatically learning to represent features from data. This means the model can capture complex patterns and contextual relationships in text without the need for manual features. This is why researchers focus their studies on using deep learning. Deep learning models commonly used for text classification include Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs).

2.2. Sentiment Classification Based on Deep Learning

With the advent of deep learning, there has been a paradigm shift towards more sophisticated models that automatically learn features from raw data. The introduction of deep learning has significantly enhanced the performance of sentiment classification tasks. Convolutional Neural Networks (CNNs) have been employed to capture local dependencies and patterns in text through convolutional layers. Kim et al. [19] demonstrated that simple CNN model with static vectors could achieve competitive results on sentiment classification tasks by leveraging word embeddings like word2vec and GloVe.

CNN tends to rely on convolutional features that capture local features from the text. This means that CNN will be very effective in capturing short or local patterns in the text, but it will be less effective in capturing global context or long-term relationships between words that are far apart in the text sequence. Recurrent Neural Networks (RNNs), especially variants such as Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU), are designed to capture sequence dependencies and long-term context by processing text inputs sequentially and retaining information from previous steps. LSTM-based models have proven effective in improving sentiment classification tasks due to their ability to model sequential data and capture long-range dependencies in text. Wang et al. [20] proposed an attention-based LSTM model utilizing Word2Vec embeddings, which significantly enhanced sentiment classification by focusing on relevant parts of the text.

When it comes to sentiment classification, this is inherently tied to language processing. Currently, many researchers are studying sentiment classification across various languages, including Chinese. In a study by Yang et al. [21] they proposed a model for analyzing sentiment in product reviews in China using both lexicon and deep learning, termed SLCABG. While deep learning typically does not require human assistance to extract features, their proposed model incorporates sentiment lexicons to refine features from reviews. For the deep learning models, CNN and GRU are utilized to extract main sentiment features and their contexts from reviews. Additionally, an attention mechanism is employed to assign weights. SLCABG was evaluated using data from e-commerce platform dangdang.com, specifically focusing on book reviews.

Another domain proposing sentiment classification models is movie reviews. In a study conducted by Rehman et al. [22], a hybrid model combining the strengths of LSTM and CNN, termed Hybrid CNN-LSTM, was proposed. CNN efficiently extracts features using convolutional layers and max-pooling layers, while LSTM captures long-term dependencies among word sequences. The proposed model uses Word2Vec to convert words into vectors that can be processed by the machine. Given the model’s goal of analyzing movie reviews, the data used to evaluate the proposed model were from IMDB and Amazon movie reviews. The proposed model performed well compared to traditional models. Sentiment classification on movie review data was also conducted by Rajesh et al. [23]. Unlike the previous model, their proposed model uses multi-channel CNN and bidirectional LSTM. To address the problem of extracting semantic information with CNN and local information with LSTM, an attention model was added. The model was evaluated using the IMDB dataset.

By leveraging the advantages of word embedding and deep learning models, Salur et al. [24] proposed a hybrid model by combining several word embedding models and several deep learning models. Their model extracts features from a different deep learning model, then combines those features and classifies them against the corresponding sentiment analyzed using Twitter datasets in Turkish. The output of the character level embedding is then used as an input from CNN for feature extraction, while the output from word embedding is used as the input of various variants of the RNN model for feature extraction. The resulting two features are then combined and use the softmax function to classify sentiment.

Verma et al. [25] incorporated linguistic knowledge into their proposed models by integrating POS information and dependency relationships. Their approach combines these linguistic features with statistically learned word vectors, resulting in a more comprehensive understanding of the input text. Additionally, convolutional operations were employed to merge POS tags, dependency relationships, and statistically learned word vectors. This method effectively captures the interplay between different linguistic features and statistical patterns, leading to improved performance in aspect-based sentiment analysis. Gan et al. [26] tackled the challenge of leveraging both original and multiscale high-level context features by developing a scalable multi-channel dilated CNN–BiLSTM model for sentiment analysis. This model utilizes a dilated CNN–BiLSTM architecture to capture features at different scales and employs a multi-channel approach to integrate original and high-level context features. Their method was tested on two public Chinese text datasets, showing significant improvements.

Tang et al. [27] addressed challenges in sentiment classification, such as complex semantics and multi-sentiment polarity, by proposing a scalable multi-channel CNN and Bidirectional GRU model with an attention mechanism to enhance feature extraction and contextual understanding for improved sentiment classification. Hameed et al. [28] addressed the complexity in existing sentiment classification models by proposing a simpler and more robust approach that uses a single bidirectional LSTM layer combined with global maximum and average pooling layers, focusing on efficiently capturing long-term word dependencies. Liu et al. [29] aimed to solve the limitations of RNNs and CNNs in sentence modeling, specifically the restricted global semantics in RNNs and the limited performance of CNNs due to convolutional filter widths. They proposed an Attention-based Multichannel Convolutional Neural Network (AMCNN), which combines BiLSTM for contextual encoding and scalar and vectorial attention mechanisms to generate multichannel representations, effectively enhancing n-gram feature extraction and word relations.

Context-independent word embeddings such as Word2Vec [30] and GloVe [31] represent words as fixed vectors irrespective of their context in a sentence. These embeddings have been foundational in many early deep learning models for sentiment classification. The key advantage of these embeddings lies in their ability to capture syntactic and semantic similarities between words, which improves the performance of sentiment classification models. Recent studies have continued to explore the utility of these embeddings. For instance, Jacovi et al. [32] used GloVe embeddings in conjunction with a CNN-LSTM model for sentiment classification. Their approach effectively captures both local features with CNNs and long-term dependencies with LSTMs, demonstrating significant improvements in performance on sentiment analysis benchmarks.

3. Proposed Work

In this section, we discuss the dataset, the data pre-processing process, and the proposed model for fusion text representation embedding in detail. Broadly speaking, the architecture of the proposed model can be seen in Figure 1. The sequence of words

w_{i}

from the review sentences is processed by each pre-trained word embedding, namely Word2Vec and GloVe. Next, the results from these embeddings undergo a concatenation process. The result of this concatenation serves as the input to the biGRU. The next step involves using Global Max Pooling, followed by two dropout layers and two dense layers before the output layer. A more detailed explanation of each process is provided in the following section.

3.1. Pre-Processing Methods

Sentiment classification is a model that works with text as its data. Text, especially text from online reviews, typically contains a lot of noise such as non-alphabet characters, punctuation marks, and other irrelevant elements for sentiment analysis. This noise needs to be cleaned to make the computational process efficient and accurate. In the model we propose, we use several processes to clean the text, i.e., pre-process it. The pre-processing function in this proposed model consists of steps used to clean and prepare the text for use in text analysis or machine learning models. First, this function removes all non-alphabet characters by replacing them with spaces using regular expressions. This step eliminates numbers, punctuation, and other special characters from the text. Next, the function removes isolated single letters by replacing them with a single space, ensuring that single letters like “a” or “b”, which usually have no significant meaning, are removed. After that, the function removes excess spaces by replacing them with a single space, ensuring that the text does not contain unnecessary double or more spaces. Then, all letters in the text are converted to lowercase to avoid differences between uppercase and lowercase letters, so “Apple” and “apple” are considered the same. The text is then split into a list of individual words. Next, stopwords, which are common words that frequently appear but usually have no significant meaning, such as “the”, “is”, “in”, and “and”, are removed from the text using a stopwords list from the ‘nltk’ library. Finally, the remaining list of words is rejoined into a single text string with spaces as separators. This process results in cleaned text that is ready for further analysis or use in machine learning models, ensuring that the text data are in a consistent and clean format.

3.2. GloWord_biGRU

Computer algorithms can only work with numerical data. To process text data on a computer, the text must be represented in numerical form. Text representation is a crucial process in the field of NLP. Good word representation can encode text more effectively and improve classification performance. To process sentiment classification models using neural networks, it is essential to convert text into a numerical format. This process is very important because the correct text representation will make the model more accurate and its accuracy values better. In this section, we explain in more detail the model we propose regarding fusion text representation. We describe each layer used in this proposed model.

3.2.1. Embedding Layer

GloVe and Word2Vec are two embedding techniques that transform words into high-dimensional vectors that encapsulate the semantic relationships and contextual nuance between words. Glove leverages co-occurrence statistics from a corpus to generate word vectors, capturing both local and global statistical information. In contrast, word2vec employs a shallow, two-layer neural network to produce embeddings by predicting word context (Skip-gram) or the current word based on its context (Continuous Bag of Words, CBOW). In the context of neural networks, these word embeddings serve as foundational inputs, providing a dense, continuous representation of words that facilitate better performance on various NLP tasks, especially in sentiment classification. First, we represent the embeddings for both word embeddings as

E = [E_{G l o V e}, E_{W o r d 2 V e c}]

.

Here,

E

denotes the combination embedding matrix, where

E_{G l o V e}

and

E_{W o r d 2 V e c}

are the individual embedding vectors generated by GloVe and Word2Vec models, respectively. This combined representation is utilized as input for subsequent neural network models to leverage the strengths of both embedding techniques.

E_{G l o V e}

is the embedding matrix derived from GloVe model. The GloVe algorithm constructs word vectors by factoring in the co-occurrence probability of words withing a large corpus. Specifically, it builds a co-occurrence matrix

X

, where each entry

X_{i j}

indicates how often word

i

appears in the context of word

j

. The model then minimizes a weighted least squares objective function to learn word vectors

w_{i}

and context

{\tilde{w}}_{j}

such that their dot product approximates the logarithm of the word–word co-occurrence probabilities:

J = \sum_{i, j = 1}^{V} f (X_{i j}) {(w_{i}^{T} {\tilde{w}}_{j} + b_{i} + {\tilde{b}}_{j} - \log X_{i j})}^{2}

(1)

where

J

is the objective function to be minimized,

V

is the vocabulary size (the number of words in the corpus),

w_{i}

is the word vector for word

i

,

{\tilde{w}}_{j}

is the context vector for word

j

, and

b_{i}

and

{\tilde{b}}_{j}

are the bias terms associated with the word and context, respectively.

X_{i j}

is the entry in the co-occurrence matrix that indicates how frequently word

i

appears in the context of word

j

.

f (X_{i j})

is the weighting function used to assign smaller weights to word pairs with very high or very low co-occurrence frequencies.

In GloVe,

f (X_{i j})

is a weighting function used to adjust the contribution of word pairs based on their co-occurrence frequency

X_{i j}

. This function plays a crucial role in reducing the influence of very high or very low frequencies, making the model more stable in learning word representations. The function

f (X_{i j})

is generally defined as follows:

f (X_{i j}) = \{\begin{array}{l} {(\frac{X_{i j}}{X_{m a x}})}^{α}, i f X_{i j} < X_{m a x} \\ 1, i f X_{i j} \geq X_{m a x} \end{array}

(2)

where

X_{m a x}

is the maximum threshold value, beyond which the contribution of co-occurrence frequency does not increase.

α

is a parameter that controls the shape of the weighting function (usually set between 0.5 and 1). This

f (X_{i j})

function ensures that word pairs with very low frequencies do not contribute too little, and those with very high frequencies do not dominate the training process.

Meanwhile,

E_{W o r d 2 V e c}

represents the embedding matrix obtained from the Word2vec model. Word2vec offers two main architectures for generating word embeddings: Skip-gram and Continuous Bag of Words (CBOW). In the Skip-gram model, the algorithm predicts the context words given a target word within a fixed window size. The objective of skip-gram is to maximize the average log probability:

J_{s} = \frac{1}{T} \sum_{t = 1}^{T} \sum_{- c \leq j \leq c, j \neq 0} \log p (w_{t + j} | w_{t})

(3)

where

J_{s}

is the objective function to be maximized,

T

is the total number of words in the corpus, and

c

is the window size, representing the number of context words on each side of the target word

w_{t}

.

\log p (w_{t + j} | w_{t})

is the log probability of the context word

w_{t + j}

given the target word

w_{t}

. In the CBOW model, the algorithm predicts the target word based on the context word within the window. The objective is to maximize the probability of the target word given the surrounding context words:

J_{c} = \frac{1}{T} \sum_{t = 1}^{T} \log p (w_{t} | w_{t - c}, \dots, w_{t - 1}, w_{t + 1}, \dots, w_{t + c})

(4)

In both architectures, the probabilities

p (w_{O} | w_{I})

are defined using the softmax function:

p (w_{O} | w_{I}) = \frac{\exp (v_{w_{O}}^{T} v_{w_{I}})}{\sum_{w = 1}^{V} e x p (v_{w}^{T} v_{w_{I}})}

(5)

where

v_{w_{I}}

and

v_{w_{O}}

are the input and output vector representations of words. Through this training process, Word2Vec learns word vectors that capture semantic and syntactic relationship based on the context in which words appears. By combining

E_{G l o V e}

and

E_{W o r d 2 V e c}

into the embedding matrix

E

, we can harness the complementary strengths of both embedding techniques, leading to richer and more informative word representations for downstream neural network model.

In our proposed model, let

V

be the vocabulary size and

d

be the embedding dimension; then, the embedding matrix for GloVe can be represented as

E_{G l o V e} \in R^{V \times d}

. Here, each word

w_{i}

is represented by the embedding vector

E_{G l o V e} (w_{i}) \in R^{d}

. Meanwhile, Word2Vec is a neural network-based embedding method that learns word representations by maximizing the probability of neighboring words in a given context. The formula for the embedding matrix in Word2Vec can be represented

E_{W o r d 2 V e c} \in R^{V \times d}

. Here, each word

w_{i}

is represented by the embedding vector

E_{w o r d 2 v e c} (w_{i}) \in R^{d}

. Once all the embedding matrices are ready, the next step is to build a fusion model to combine these two embedding matrices. Suppose we have input text

x

that has been processed into a sequence of words

\{w_{1}, w_{2}, \dots, w_{T}\}

, where

T

is the maximum length of the sequence. The model for each embedding can be seen below:

x_{G l o V e} = [\begin{matrix} E_{G l o V e} (w_{1}) \\ E_{G l o V e} (w_{2}) \\ \dots \\ E_{G l o V e} (w_{T}) \end{matrix}]

with

x_{G l o V e} \in R^{T \times d}

x_{W o r d 2 V e c} = [\begin{matrix} E_{w o r d 2 v e c} (w_{1}) \\ E_{w o r d 2 v e c} (w_{2}) \\ \dots \\ E_{w o r d 2 v e c} (w_{T}) \end{matrix}]

with

x_{W o r d 2 V e c} \in R^{T \times d}

. Once all the embedding matrices are ready, the next step is to build an embedding fusion, which is a combination of

x_{G l o V e}

and

x_{W o r d 2 V e c}

.

x = C o n c a t e n a t e (x_{G l o V e}, x_{W o r d 2 V e c})

(6)

with

x \in R^{T \times 2 d}

, where concatenation is performed in the embedding dimension so that each word is represented by a vector embedding of dimension

2 d

. Mathematically, the embedding layer transforms the sequence of words

\{w_{1}, w_{2}, \dots, w_{T}\}

into a sequence of embedding vectors

\{e_{1}, e_{2}, \dots, e_{T}\}

:

e_{i} = [E_{G l o V e} (w_{i}) ∥ E_{W o r d 2 V e c} (w_{i})]

(7)

where

e_{i}

, is the concatenated embedding for the word

w_{i}

, and

∥

denotes the concatenation operator.

3.2.2. biGRU Layer

A Gated Recurrent Unit (GRU) is a type of recurrent neural network (RNN) designed to efficiently capture dependencies in sequential data while mitigating the vanishing gradient problem that traditional RNNs often face. The GRU unit employs two gates—reset and update gates—to regulate the flow of information and maintain long-term dependencies. The hidden state

(h_{t})

of the GRU at time

t

is computed based on the current input

e_{t}

and the previous hidden state

(h_{t - 1})

and can be mathematically represented as follows:

h_{t} = G R U (e_{t}, h_{t - 1})

(8)

Building on the GRU’s ability to capture dependencies in sequential data, the bidirectional GRU (biGRU) enhances this capability by processing the data in both forward and backward directions. This bidirectional approach allows the model to harness contextual information from both past and future sequences, providing a richer understanding of the data.

In our model, after the embedding layer process, the sequence of embeddings

\{e_{1}, e_{2}, \dots, e_{T}\}

is passed through a biGRU layer. biGRU consists of two GRU layers (forward and backward) that process data from start to end and from end to start, respectively. Suppose

\vec{h_{t}}

is hidden state of the forward GRU at time

t

; then, the forward GRU is defined as follows:

\vec{h_{t}} = G R U (e_{t}, \vec{h_{t - 1}})

(9)

Meanwhile, if

\overset{\leftarrow}{h_{t}}

is the hidden state of backward GRU at time

t

, then the backward GRU is defined as follows:

\overset{\leftarrow}{h_{t}} = G R U (e_{t}, \overset{\leftarrow}{h_{t - 1}})

(10)

Therefore, the output of the biGRU at time

t

is the combination of the forward and backward hidden states:

h_{t} = [\vec{h_{t}} ∥ \overset{\leftarrow}{h_{t}}]

(11)

3.2.3. Global Max Pooling

After passing through the biGRU, we obtain a sequence of hidden states

\{h_{1}, h_{2}, \dots, h_{T}\}

. Global Max Pooling takes the maximum value for each feature across the time dimension

t

:

p_{j} = {m a x}_{1 \leq t \leq T} h_{t, j}

(12)

3.2.4. Dense and Dropout Layer

The result of Global Max Pooling

p

is passed through several Dense and Dropout layers. The first dense layer uses ReLU activation and dropout:

d_{1} = D r o p o u t (R e L U (W_{1} p + b_{1}))

(13)

where

W_{1}

and

b_{1}

are the weights and bias of the first Dense layer, and

d_{1}

is the output of dropout. Meanwhile, the second Dense layer also uses ReLU activation and dropout:

d_{2} = D r o p o u t (R e L U (W_{2} d_{1} + b_{2}))

(14)

where

W_{2}

dan

b_{2}

are the wights and bias of the second Dense layer, and

d_{2}

is the result of the second output.

3.2.5. Output Layer

In the final stage,

d_{2}

is passed through a Dense (fully connected) layer with sigmoid activation to produce the final predictions.

\hat{y} = σ (W_{3} d_{2} + b_{3})

(15)

where

σ

is the sigmoid activation function, defined as follows:

σ (x) = \frac{1}{1 + e^{- x}}

(16)

The sigmoid function maps any real-valued input to a value between 0 and 1.

W_{3}

is the weight matrix of the output layer.

b_{3}

is the bias vector of the output layer.

\hat{y}

is the final prediction of the model, representing a value between 0 and 1. In binary sentiment classification, the output

\hat{y}

is interpreted as the probability of the input text expressing positive sentiment. A threshold is used to decide between positive and negative sentiment. If

\hat{y} \geq 0.5

, the prediction is positive sentiment; otherwise, it is negative sentiment.

4. Experiments and Results

This section provides an in-depth analysis of the experimental methodology and the outcomes acquired. The following aspects are addressed: the dataset utilized, the employed model assessment methods, and the obtained results and discussions from the experiment. We performed the tests using several hyperparameter configurations to attain optimal outcomes and mitigate overfitting.

4.1. Dataset

The IMDB dataset, which comprises 50,000 reviews, is extensively utilized in machine learning for sentiment categorization. This dataset is commonly employed for training and evaluating models capable of discerning if a movie review exhibits a positive or negative attitude. The collection is sourced from IMDB reviews contributed by diverse people who offer ratings and reviews for the movies they have watched. For this study, we employed a sample size of 50,000 reviews, which were equally divided into 25,000 positives reviews and 25,000 negatives reviews. We divided the dataset into training data, testing data, and validation data. Training data were used during the training phase with the aim of training the model to recognize patterns and learn from the data. The training data were used to update the weights and biases in the model through backpropagation and optimization processes. Validation data were used during the training phase but after each epoch or iteration. The purpose of using these data was to validate the model during training to assess its performance and to tune hyperparameters. This helps in preventing overfitting by providing feedback on the model’s performance on data not seen during training. After the model was trained on the training set, it was evaluated on the validation set to monitor performance metrics such as accuracy, loss, or other relevant metrics. These results were used to adjust hyperparameters such as learning rate, number of layers, or units in each layer.

Testing data were used after the model was fully trained and all tuning was completed. The purpose of these data was to evaluate the final performance of the model to provide an unbiased estimate of its performance on completely new data. These data were used once at the end to measure accuracy, precision, recall, F1 score, or other metrics. These results give an idea of how well the model will perform in a real-world environment. The total data in the IMDB dataset are 50,000. We divided the data into 80% training data and 20% testing data, resulting in 40,000 training data and 10,000 testing data. For validation data, the training data were further split to obtain a validation set consisting of 20% of the training data. Therefore, out of 40,000 training data, the actual training data used were 32,000, and the validation data used were 8000. The IMDB dataset included in this study is available for access at the following URL: https://ai.stanford.edu/~amaas/data/sentiment/ (accessed on 19 June 2024) [33].

To compare the results, we also used dataset evaluation with 5-fold cross-validation. Cross-validation is a robust statistical method used to assess the generalizability of a model by partitioning the data into multiple subsets (folds). In our proposed model, the dataset is divided into five equally sized folds. During each iteration, four folds are used for training the model, and the remaining fold is used for testing. We employed 5-fold cross-validation to evaluate the performance of the sentiment analysis model on the IMDB dataset. This technique involves dividing the dataset into five equal parts (folds). In each iteration, four parts are used to train the model, while one part is used to test the model. This process is repeated five times so that each part of the data is used once as the test set. The implementation of 5-fold cross-validation was carried out using the ‘KFold’ class from ‘sklearn.model_selection’. This implementation involves several stages. In the first stage, the preprocessed data, which has been converted into numerical sequences, is divided into five parts using ‘KFold(n_splits = 5, shuffle = True, random_state = 42)’. The parameter ‘n_splits = 5’ indicates that the data will be divided into five folds, ‘shuffle = True’ ensures that the data are shuffled before being divided into folds, and ‘random_state = 42’ is used to ensure reproducibility of results. In the second stage, during each iteration of the for loop, one fold is used as the test set, while the remaining four folds are used as the training set. The indices of the training and test data are obtained from the ‘split’ method of the ‘KFold’ object.

In the third stage, a new model is created and trained during each iteration using the training data (X_train and y_train). The model is trained using the ‘fit’ function, which also includes validation by splitting the training data into a training subset and a validation subset (‘validation_split = 0.2’). In the fourth stage, after the model is trained on the training data, it is evaluated on the test data (X_test and y_test). Predictions are made using the ‘predict’ method, and these predictions are compared with the actual labels to calculate various performance metrics such as accuracy, precision, recall, F1 score, and the Matthews Correlation Coefficient (MCC). In the fifth stage, the evaluation results from each fold (accuracy, precision, recall, F1 score, MCC, and loss) are stored in an array. Additionally, the training history is also saved for each fold for further analysis. In the sixth and final stage, after all folds have been processed, the average of all performance metrics is calculated to provide an overall assessment of the model’s performance. By using 5-fold cross-validation, the model is thoroughly evaluated on various subsets of data, which helps reduce the risk of overfitting and provides a more accurate estimate of the model’s performance on unseen data. This process enhances the reliability of the evaluation results and ensures that the model has good generalization capabilities.

4.2. Evaluation Matrix

The evaluation matrix is a method utilized to assess the performance of a sentiment classification model in accurately identifying and categorizing the sentiment expressed in provided texts. The construction of this evaluation matrix often relies on the model’s predictions when compared to the test data, which contain known labels. The following are the primary elements of a confusion matrix that is frequently employed for sentiment classification:

True Positive (TP): the number of data points correctly classified as positive;
True Negative (TN): the number of data points correctly classified as negative;
False Positive (FP): the number of data points incorrectly classified as positive (when they should be negative). This is also known as a Type I error;
False Negative (FN): the number of data points incorrectly classified as negative (when they should be positive). This is also known as a Type II error.

From the four components above, we can calculate several useful evaluation metrics for sentiment classification [34]:

accuracy: The percentage of total correct predictions by the model, calculated as follows:

a c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(17)

precision: The percentage of true-positive results out of all positive results predicted by the model, calculated as follows:

p r e c i s i o n = \frac{T P}{T P + F P}

(18)

Recall (Sensitivity): The percentage of true-positive results identified by the model out of all actual positive results, calculated as follows:

r e c a l l = \frac{T P}{T P + F N}

(19)

F1: a metric that combines precision and recall, providing a better understanding of the model’s overall performance, calculated as follows:

F 1 = \frac{2 * p r e c i s i o n * r e c a l l}{p r e c i s i o n + r e c a l l}

(20)

4.3. Loss Function

In the conducted research, we also examined the condition of the generated loss as a comparison in the model we proposed. The loss function used in this study is binary cross-entropy [35]. This function is a loss function used for binary classification problems where the model’s output is the probability of the positive class (in this case, positive sentiment). This function calculates the error between the expected label and the model prediction in the form of probabilities and then minimizes this error during the training process using optimization algorithms such as Adam, which is used in this research. Generally, this function measures the difference between two probability distributions: the actual distribution (label) and the distribution predicted by the model. For example, let us have the following:

$y$ as the actual label (0 or 1);
$\hat{y}$ as the predicted probability from the model for the positive class (a value between 0 and 1).

Thus, binary cross-entropy

L

can be expressed as follows:

L = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \log ({\hat{y}}_{i}) + (1 - y_{i}) \log (1 - {\hat{y}}_{i})]

(21)

where

N

is the number of samples,

y_{i}

is the actual label for the

i

-th sample and,

{\hat{y}}_{i}

is the predicted probability for the

i

-th sample. For

y_{i} \log ({\hat{y}}_{i})

, if

y_{i} = 1

, this component ensures that the predicted probability

{\hat{y}}_{i}

for the positive class (1) is taken into account. If

{\hat{y}}_{i}

approaches 1, the loss will be small; if it approaches 0, the loss will be large. For

(1 - y_{i}) \log (1 - {\hat{y}}_{i})

, if

y_{i} = 0

, this component ensures that the predicted probability

{\hat{y}}_{i}

for the negative class (0) is taken into account. If

{\hat{y}}_{i}

approaches 0, the loss will be small; if it approaches 1, the loss will be large. For interpretation, if the proposed model’s predictions are very good (i.e.,

{\hat{y}}_{i}

is close to

y_{i}

), the loss value will approach 0. If the model’s predictions are poor (i.e.,

{\hat{y}}_{i}

is far from

y_{i}

), the loss value will be high.

4.4. Experimental Results and Discussions

This section discussed the results obtained from the proposed model. We compared our proposed model with several different scenarios. Table 1 shows the parameters used for model evaluation. These parameters were derived from various experimental scenarios to achieve optimal result.

For our initial experiment, we conducted a comparison of the suggested model utilizing various dimensions of GloVe embeddings. However, in the case of word2vec, we exclusively employed a single dimension, specifically 300. The GloVe dimensions utilized for comparing our suggested model were 50, 100, 200, and 300. Based on the conducted experiments, the model utilizing GloVe dimension 300 outperformed models using other GloVe dimensions, reaching an accuracy of 89.8% and an F1 score of 90.06%. This result roughly rivals the performance achieved while utilizing GloVe dimension 100. Table 2 displays comprehensive findings derived from several GloVe dimensions. GloWord is the text representation fusion model that we propose.

Furthermore, we investigated the influence of different GloVe dimensions and also explored the impact of varying units in the dense layer. We employed dense layers with 32, 64, and 128 units. In the experiment evaluating the influence of the dense layer, we employed biGRU units configured to a size of 64 units. The dense layer’s impact was maximized when set to 64 units, resulting in the maximum performance achieved during evaluation. Based on initial findings, it can be concluded that the optimal performance is achieved when the biGRU and dense layer have an equal number of units. The maximum performance attained an accuracy of 89.9% and an F1 score of 90.12%. Table 3 demonstrates the performance outcomes while utilizing various units in the dense layer.

Table 4 shows the impact of employing various biGRU units. For this analysis, we employed a dense layer consisting of 128 units. Our objective was to confirm our initial finding that matching the number of units in the dense layer with the biGRU would improve performance. Table 4 demonstrates that employing a dense layer with 128 units and biGRU with 64 units yields inferior performance compared to the preceding outcomes in Table 3. In Table 3, the accuracy was 89.8% and the F1 score was 90.06% while utilizing the same 64 units. Nevertheless, by employing 128 units for both layers in Table 4, the model’s performance was enhanced, yielding an accuracy of 89.96% and an F1 score of 90.21%. To achieve optimal results, we utilized 128 units for both layers and examined the effects of modifying additional parameters.

In Table 5, we analyze the impact of using dropout rates. We evaluated dropout rates ranging from 0.1 to 0.5. The model we proposed achieved the best performance with a dropout rate of 0.4, maintaining accuracy and F1 score values identical to the previous ones at 89.96% and 90.21%, respectively.

Another parameter we evaluated is the impact of L2 regularization, which is used here to address overfitting issues. We created several scenarios to examine our proposed model with different regularization values to determine the optimal performance. We conducted evaluations by setting the regularization values ranging from 0.1 to 0.0001. The best results were obtained with a regularization value of 0.001, which is the initial setting for our proposed model. The details of the results can be seen in Table 6.

In addition to analyzing the impact of hyperparameters, we also evaluated the use of other deep learning models, namely CNN, LSTM, GRU, and biLSTM. We assessed the performance of the GloWord fusion embedding with these deep learning models to verify that our proposed model performs best. We can see the details in Table 7. After simulating the alternative deep learning models, we found that our proposed GloWord model performs best when using the biGRU model. This conclusion is further supported by evaluating the loss values of each model, where GloWord_biGRU achieved the lowest loss value of 0.2512 compared to the other models. In our proposed model, the biGRU model leverages the combined embeddings to analyze word relationships in both directions (left-to-right and vice versa). With a richer representation from both embeddings, the model can more effectively capture sentiment patterns within text sequences, supporting enhanced performance in sentiment classification compared to using one embedding type.

In Figure 2, we present the performance history of the training and validation data throughout the training process, comparing our proposed model with other deep learning models. The graph showing accuracy performance during training data usage indicates that the accuracy of our proposed model (shown in the graph as biGRU training) competes closely with that of the biLSTM deep learning model. This is evident from Table 7, which shows that the accuracy difference between using biGRU and biLSTM is approximately 0.68%. However, this can be further examined through the F1 scores obtained by both models. When using the biLSTM deep learning model, the F1 score is 89.03%, whereas when using biGRU, the F1 score is 90.21%. Additionally, the loss value produced using biGRU is significantly lower at 0.2512. Moreover, biGRU has lower computational requirements compared to biLSTM, as the biGRU model uses fewer parameters. Therefore, our proposed GloWord_biGRU model achieves the best performance with lower computational costs. This is evidenced by the training time, inference time, and memory consumption utilized during the training process. These results are presented in Table 8.

Meanwhile, in Figure 3, we display the loss performance for all models. It can be observed that when using CNN, the resulting loss is the highest. This is due to CNN’s inability to effectively handle sequential data.

Table 9 presents the results of experiments using single word embeddings. We found that our proposed model still delivers the best performance and produces lower loss compared to using single word embeddings. The performance history during the training process with both training and validation data can be seen in Figure 4. It is evident that throughout the training process, our proposed model exhibits the best performance compared to other models when using the training data. Additionally, when using the validation data, our proposed model shows more stable performance compared to the other models.

In addition to comparing our results with the various scenarios we created, we also compared our findings with several previous studies. Table 10 shows the comparison results between our proposed model and several models from previous research. As seen in Table 10, our proposed model achieves the highest accuracy and F1 score among all. We compared our model with studies that used the same dataset as we did, the IMDB dataset, ensuring a balanced comparison.

The results of the comparison using 5-fold cross-validation are presented in Table 11. We used the 5-fold cross-validation model as a comparison to the results of the model we proposed and as evidence that overfitting was minimal. In addition to presenting the results of each fold, we also present the average results of the five folds conducted. The results show that the best performance was achieved during the third fold, with an F1 score of 90.11%. Meanwhile, the average F1 score obtained from the 5-fold cross-validation process was 89.21%. Using the CV technique provides a more reliable estimate of performance, as its better accounts for variations in the data. This reduces the likelihood of overfitting to a particular dataset, leading to more accurate evaluation scores.

5. Conclusions

In this study, we propose a fusion text representation model, GloWord_biGRU, to enhance the performance of sentiment classification on online reviews. Our proposed model aims to improve sentiment classification performance by using context-independent word embeddings, namely word2vec and GloVe. The choice of this model can improve text representation quality with low computational cost. By combining these two pre-trained word embedding models, we can enhance the contextual meaning of the text, thereby improving the performance of sentiment classification. We used movie review data from the IMDB dataset to evaluate our proposed model. We employed biGRU as the model for sentiment classification. Various scenarios were tested to demonstrate that our proposed model performs better, including the influence of the number of units in the biGRU, GloVe dimensions, dense layer units, dropout rate, L2 regularization rate, different types of deep learning models, and comparisons with previous research. With these numerous evaluation scenarios, our results showed that the proposed model achieved the best performance among all, with an accuracy of 89.96% and an F1 score of 90.21%.

Meanwhile, we used the cross-validation (CV) technique to reduce the occurrence of overfitting, and the average F1 score obtained after performing 5-fold cross-validation was 89.21%. This value is still higher compared to several sentiment classification models conducted previously. Although our proposed model achieved the best performance in this study, further improvements are necessary. For example, we could enhance text representation quality by combining context-independent embeddings with transformers while maintaining low computational cost. Additionally, exploring other models beyond deep learning, such as graph models or capsule networks, is essential. Furthermore, incorporating diverse datasets from multiple domains would provide a more comprehensive evaluation of model robustness and generalization capabilities, allowing us to analyze performance across various contexts. These considerations will guide our future research.

Author Contributions

Conceptualization, K.W.T.; methodology, K.W.T.; software, E.M.D.; validation, J.H.; formal analysis, K.W.T.; investigation, H.L.; resources, H.L.; data curation, K.W.T.; writing—original draft preparation, K.W.T.; writing—review and editing, J.H.; visualization, K.W.T.; supervision, J.H.; project administration, J.H.; funding acquisition, K.W.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data utilized in this study can be accessed at https://ai.stanford.edu/~amaas/data/sentiment/ (accessed on 19 June 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jain, P.K.; Pamula, R.; Srivastava, G. A Systematic Literature Review on Machine Learning Applications for Consumer Sentiment Analysis Using Online Reviews. Comput. Sci. Rev. 2021, 41, 100413. [Google Scholar] [CrossRef]
Xu, F.; Pan, Z.; Xia, R. E-Commerce Product Review Sentiment Classification Based on a Naïve Bayes Continuous Learning Framework. Inf. Process. Manag. 2020, 57, 102221. [Google Scholar] [CrossRef]
Borg, A.; Boldt, M. Using VADER Sentiment and SVM for Predicting Customer Response Sentiment. Expert Syst. Appl. 2020, 162, 113746. [Google Scholar] [CrossRef]
AlBadani, B.; Shi, R.; Dong, J. A Novel Machine Learning Approach for Sentiment Analysis on Twitter Incorporating the Universal Language Model Fine-Tuning and SVM. Appl. Syst. Innov. 2022, 5, 13. [Google Scholar] [CrossRef]
Neelakandan, S.; Paulraj, D. A Gradient Boosted Decision Tree-Based Sentiment Classification of Twitter Data. Int. J. Wavelets Multiresolution Inf. Process. 2020, 18, 2050027. [Google Scholar] [CrossRef]
Syamala, M.; Nalini, N.J. A Filter Based Improved Decision Tree Sentiment Classification Model for Real-Time Amazon Product Review Data. Int. J. Intell. Eng. Syst. 2020, 13, 191–202. [Google Scholar] [CrossRef]
Dong, S.; Wang, P.; Abbas, K. A Survey on Deep Learning and Its Applications. Comput. Sci. Rev. 2021, 40, 100379. [Google Scholar] [CrossRef]
Alharbi, N.M.; Alghamdi, N.S.; Alkhammash, E.H.; Al Amri, J.F. Evaluation of Sentiment Analysis via Word Embedding and RNN Variants for Amazon Online Reviews. Math. Probl. Eng. 2021, 2021, 5536560. [Google Scholar] [CrossRef]
Song, M.; Park, H.; Shin, K. Attention-Based Long Short-Term Memory Network Using Sentiment Lexicon Embedding for Aspect-Level Sentiment Analysis in Korean. Inf. Process. Manag. 2019, 56, 637–653. [Google Scholar] [CrossRef]
Onan, A. Bidirectional Convolutional Recurrent Neural Network Architecture with Group-Wise Enhancement Mechanism for Text Sentiment Classification. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 2098–2117. [Google Scholar] [CrossRef]
Onan, A. Sentiment Analysis on Product Reviews Based on Weighted Word Embeddings and Deep Neural Networks. Concurr. Comput. Pract. Exp. 2021, 33, e5909. [Google Scholar] [CrossRef]
Muhammad, P.F.; Kusumaningrum, R.; Wibowo, A. Sentiment Analysis Using Word2vec and Long Short-Term Memory (LSTM) For Indonesian Hotel Reviews. Procedia Comput. Sci. 2021, 179, 728–735. [Google Scholar] [CrossRef]
Kamyab, M.; Liu, G.; Adjeisah, M. Attention-Based CNN and Bi-LSTM Model Based on TF-IDF and GloVe Word Embedding for Sentiment Analysis. Appl. Sci. 2021, 11, 11255. [Google Scholar] [CrossRef]
Shaik, T.; Tao, X.; Dann, C.; Xie, H.; Li, Y.; Galligan, L. Sentiment Analysis and Opinion Mining on Educational Data: A Survey. Nat. Lang. Process. J. 2023, 2, 100003. [Google Scholar] [CrossRef]
Kaur, G.; Sharma, A. A Deep Learning-Based Model Using Hybrid Feature Extraction Approach for Consumer Sentiment Analysis. J. Big Data 2023, 10, 5. [Google Scholar] [CrossRef]
Chauhan, P.; Sharma, N.; Sikka, G. The Emergence of Social Media Data and Sentiment Analysis in Election Prediction. J. Ambient Intell. Humaniz. Comput. 2021, 12, 2601–2627. [Google Scholar] [CrossRef]
Mishev, K.; Gjorgjevikj, A.; Vodenska, I.; Chitkushev, L.T.; Trajanov, D. Evaluation of Sentiment Analysis in Finance: From Lexicons to Transformers. IEEE Access 2020, 8, 131662–131682. [Google Scholar] [CrossRef]
Semary, N.A.; Ahmed, W.; Amin, K.; Pławiak, P.; Hammad, M. Improving Sentiment Classification Using a RoBERTa-Based Hybrid Model. Front. Hum. Neurosci. 2023, 17, 1292010. [Google Scholar] [CrossRef]
Kim, Y. Convolutional Neural Networks for Sentence Classification 2014. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014. [Google Scholar]
Wang, X.; Jiang, W.; Luo, Z. Combination of Convolutional and Recurrent Neural Network for Sentiment Analysis of Short Texts. In Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, 11–17 December 2016; pp. 2428–2437. [Google Scholar]
Yang, L.; Li, Y.; Wang, J.; Sherratt, R.S. Sentiment Analysis for E-Commerce Product Reviews in Chinese Based on Sentiment Lexicon and Deep Learning. IEEE Access 2020, 8, 23522–23530. [Google Scholar] [CrossRef]
Rehman, A.U.; Malik, A.K.; Raza, B.; Ali, W. A Hybrid CNN-LSTM Model for Improving Accuracy of Movie Reviews Sentiment Analysis. Multimed. Tools Appl. 2019, 78, 26597–26613. [Google Scholar] [CrossRef]
Rajesh, A.; Hiwarkar, T. Sentiment Analysis from Textual Data Using Multiple Channels Deep Learning Models. J. Electr. Syst. Inf. Technol. 2023, 10, 56. [Google Scholar] [CrossRef]
Salur, M.U.; Aydin, I. A Novel Hybrid Deep Learning Model for Sentiment Classification. IEEE Access 2020, 8, 58080–58093. [Google Scholar] [CrossRef]
Verma, S.; Kumar, A.; Sharan, A. MuCon: Multi-Channel Convolution for Targeted Sentiment Classification. Multimed. Tools Appl. 2023, 83, 28615–28633. [Google Scholar] [CrossRef]
Gan, C.; Feng, Q.; Zhang, Z. Scalable Multi-Channel Dilated CNN–BiLSTM Model with Attention Mechanism for Chinese Textual Sentiment Analysis. Future Gener. Comput. Syst. 2021, 118, 297–309. [Google Scholar] [CrossRef]
Cheng, Y.; Yao, L.; Xiang, G.; Zhang, G.; Tang, T.; Zhong, L. Text Sentiment Orientation Analysis Based on Multi-Channel CNN and Bidirectional GRU With Attention Mechanism. IEEE Access 2020, 8, 134964–134975. [Google Scholar] [CrossRef]
Hameed, Z.; Garcia-Zapirain, B. Sentiment Classification Using a Single-Layered BiLSTM Model. IEEE Access 2020, 8, 73992–74001. [Google Scholar] [CrossRef]
Liu, Z.; Huang, H.; Lu, C.; Lyu, S. Multichannel CNN with Attention for Text Classification 2020. arXiv 2020, arXiv:2006.16174. [Google Scholar]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed Representations of Words and Phrases and Their Compositionality. 2013. Available online: https://proceedings.neurips.cc/paper_files/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf (accessed on 30 August 2024).
Pennington, J.; Socher, R.; Manning, C. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Association for Computational Linguistics: Doha, Qatar, 2014; pp. 1532–1543. [Google Scholar]
Jacovi, A.; Sar Shalom, O.; Goldberg, Y. Understanding Convolutional Neural Networks for Text Classification. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, 1 November 2018; Association for Computational Linguistics: Brussels, Belgium, 2018; pp. 56–65. [Google Scholar]
Maas, A.L.; Daly, R.E.; Pham, P.T.; Huang, D.; Ng, A.Y.; Potts, C. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; Lin, D., Matsumoto, Y., Mihalcea, R., Eds.; Association for Computational Linguistics: Portland, OR, USA, 2011; pp. 142–150. [Google Scholar]
Trisna, K.W.; Jie, H.J. Deep Learning Approach for Aspect-Based Sentiment Classification: A Comparative Review. Appl. Artif. Intell. 2022, 36, 2014186. [Google Scholar] [CrossRef]
Connor, R.; Dearle, A.; Claydon, B.; Vadicamo, L. Correlations of Cross-Entropy Loss in Machine Learning. Entropy 2024, 26, 491. [Google Scholar] [CrossRef]
Behera, R.K.; Jena, M.; Rath, S.K.; Misra, S. Co-LSTM: Convolutional LSTM Model for Sentiment Analysis in Social Big Data. Inf. Process. Manag. 2021, 58, 102435. [Google Scholar] [CrossRef]
Khan, F.H.; Qamar, U.; Bashir, S. SentiMI: Introducing Point-Wise Mutual Information with SentiWordNet to Improve Sentiment Polarity Detection. Appl. Soft Comput. 2016, 39, 140–153. [Google Scholar] [CrossRef]
Bodapati, J.; Veeranjaneyulu, N.; Shaik, S. Sentiment Analysis from Movie Reviews Using LSTMs. Ingénierie Systèmes Inf. 2019, 24, 125–129. [Google Scholar] [CrossRef]
Ma, Y.; Fan, H.; Zhao, C. Feature-Based Fusion Adversarial Recurrent Neural Networks for Text Sentiment Classification. IEEE Access 2019, 7, 132542–132551. [Google Scholar] [CrossRef]
Fu, X.; Yang, J.; Li, J.; Fang, M.; Wang, H. Lexicon-Enhanced LSTM With Attention for General Sentiment Analysis. IEEE Access 2018, 6, 71884–71891. [Google Scholar] [CrossRef]
Trisna, K.W.; Huang, J.; Lei, H. From Context-Independent Embedding to Transformer: Exploring Sentiment Classification in Online Reviews with Deep Learning Approaches. J. Theor. Appl. Inf. Technol. 2024, 102, 6980–7003. [Google Scholar]

Figure 1. Proposed GloWord_biGRU Architecture.

Figure 2. Accuracy performance on training and validation data among various deep learning model.

Figure 3. Loss during training on training and validation data compare with other deep learning models.

Figure 4. Accuracy performance compare with single word embedding.

Table 1. Hyperparameter.

Parameter	Dimension/Unit
Word2vec	300
Glove	300
biGRU	128
Dense	128
Epoch	10
Dropout	0.4
Pool	GlobalMaxPooling
Activation	ReLu
L2 Regularization	0.001
Batch size	128

Table 2. Impact of different dimension of GloVe.

Model	Accuracy	Precision	Recall	F1
GloWord_biGRU_50	89.72	89.75	89.86	89.81
GloWord_biGRU_100	89.63	87.32	92.92	90.03
GloWord_biGRU_200	89.31	86.25	93.73	89.83
GloWord_biGRU_300	89.8	88.49	91.68	90.06

Table 3. Impact of different unit of dense layer.

Model	Accuracy	Precision	Recall	F1
GloWord_biGRU_32	89.79	89.19	90.73	89.96
GloWord_biGRU_64	89.9	89.9	91.37	90.12
GloWord_biGRU_128	89.8	88.49	91.68	90.06

Table 4. Impact of different unit of biGRU.

Model	Accuracy	Precision	Recall	F1
GloWord_biGRU_32	89.42	86.45	93.69	89.92
GloWord_biGRU_64	89.8	88.49	91.68	90.06
GloWord_biGRU_128	89.96	88.7	91.76	90.21

Table 5. Impact of different dropout rates.

Model	Accuracy	Precision	Recall	F1
GloWord_biGRU_0.1	89.58	92.15	86.7	89.35
GloWord_biGRU_0.2	89.84	90.19	89.58	89.88
GloWord_biGRU_0.3	89.65	87.09	92.29	90.08
GloWord_biGRU_0.4	89.96	88.7	91.76	90.21
GloWord_biGRU_0.5	88.41	93.42	82.83	87.81

Table 6. Impact of different L2 regularization rates.

Model	Accuracy	Precision	Recall	F1
GloWord_biGRU_0.001	89.96	88.7	91.76	90.21
GloWord_biGRU_0.01	89.46	87.28	92.58	89.85
GloWord_biGRU_0.1	89.31	87.1	92.48	89.71
GloWord_biGRU_0.0001	89.74	88.49	91.55	89.99

Table 7. Result with different deep learning models.

Model	Accuracy	Precision	Recall	F1	Loss
GloWord_CNN	87.98	86.82	89.78	88.27	0.343
GloWord_LSTM	89.50	88.87	90.49	89.68	0.2877
GloWord_GRU	89.07	87.02	92.04	89.46	0.2867
GloWord_biLSTM	89.29	92.00	86.25	89.03	0.3161
GloWord_biGRU	89.96	88.7	91.76	90.21	0.2512

Table 8. Time and Space Complexity.

Model	Training Time	Inference Time	Memory Consumption	Memory Inference
GloWord_biLSTM	244.62 s	8.62 s	9.99 MB	2.14 MB
GloWord_biGRU	188.8 s	6.06 s	9.51 MB	1.98 MB

Table 9. Result comparison with single word embedding.

Model	Accuracy	Precision	Recall	F1	Loss
Word2vec_biGRU	89.21	86.21	93.55	89.73	0.2802
Word2Vec_GRU	89.17	90.67	87.52	89.06	0.2825
Word2vec_biLSTM	88.8	88.28	89.68	88.97	0.2674
Word2vec_LSTM	88.29	86.06	91.61	88.74	0.3113
GloVe_biGRU	89.4	86.96	92.9	89.83	0.3157
GloVe_GRU	88.29	89.88	86.51	88.16	0.3267
GloVe_biLSTM	86.84	89.9	83.23	86.44	0.2700
GloVe_LSTM	86	91.34	79.74	85.15	0.2786
GloWord_biGRU	89.96	88.7	91.76	90.21	0.2512

Table 10. Result comparison with previous models of sentiment classification.

Model	Accuracy	Precision	Recall	F1
SVM [36]	83.11	83.29	82.66	82.98
Naïve Bayes [36]	78	89.87	72.30	80.14
SentiMI [37]	84	83.2	84.4	83.9
LSTM-DNN [38]	88.46	87.6	89.1	88.1
FARNN-ATT [39]	89.22	-	-	-
WALE-LSTM [40]	89.50	-	-	-
Bert+biGRU [41]	87.2	87.88	86.52	87.2
Albert+biGRU [41]	87.09	87.89	86.27	87.07
GloWord_biGRU	89.96	88.7	91.76	90.21

Table 11. The results of the 5-fold cross-validation.

Number of Fold	Accuracy	Precision	Recall	F1
Fold 1	87.83	94.17	80.85	87
Fold 2	90	89.69	90.27	89.98
Fold 3	89.9	88.29	92	90.11
Fold 4	89.32	89.02	89.67	89.34
Fold 5	89.79	90.97	88.33	89.63
Avg.	89.37	90.43	88.22	89.21

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Trisna, K.W.; Huang, J.; Liang, H.; Dharma, E.M. Fusion Text Representations to Enhance Contextual Meaning in Sentiment Classification. Appl. Sci. 2024, 14, 10420. https://doi.org/10.3390/app142210420

AMA Style

Trisna KW, Huang J, Liang H, Dharma EM. Fusion Text Representations to Enhance Contextual Meaning in Sentiment Classification. Applied Sciences. 2024; 14(22):10420. https://doi.org/10.3390/app142210420

Chicago/Turabian Style

Trisna, Komang Wahyu, Jinjie Huang, Hengyu Liang, and Eddy Muntina Dharma. 2024. "Fusion Text Representations to Enhance Contextual Meaning in Sentiment Classification" Applied Sciences 14, no. 22: 10420. https://doi.org/10.3390/app142210420

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fusion Text Representations to Enhance Contextual Meaning in Sentiment Classification

Abstract

1. Introduction

2. Related Work

2.1. Sentiment Classification

2.2. Sentiment Classification Based on Deep Learning

3. Proposed Work

3.1. Pre-Processing Methods

3.2. GloWord_biGRU

3.2.1. Embedding Layer

3.2.2. biGRU Layer

3.2.3. Global Max Pooling

3.2.4. Dense and Dropout Layer

3.2.5. Output Layer

4. Experiments and Results

4.1. Dataset

4.2. Evaluation Matrix

4.3. Loss Function

4.4. Experimental Results and Discussions

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI