Reinforcement learning in sentiment analysis: a review and future directions

Eyu, Jer Min; Yau, Kok-Lim Alvin; Liu, Lei; Chong, Yung-Wey

doi:10.1007/s10462-024-10967-0

Reinforcement learning in sentiment analysis: a review and future directions

Open access
Published: 07 November 2024

Volume 58, article number 6, (2025)
Cite this article

Download PDF

You have full access to this open access article

Artificial Intelligence Review Aims and scope Submit manuscript

Reinforcement learning in sentiment analysis: a review and future directions

Download PDF

Jer Min Eyu¹,
Kok-Lim Alvin Yau¹,
Lei Liu² &
…
Yung-Wey Chong³

1282 Accesses
Explore all metrics

Abstract

Sentiment analysis in natural language processing (NLP) is used to understand the polarity of human emotions (e.g., positive and negative) and preferences (e.g., price and quality). Reinforcement learning (RL) enables a decision maker (or agent) to observe the operating environment (or the current state) and select the optimal action to receive feedback signals (or reward) from the operating environment. Deep reinforcement learning (DRL) extends RL with deep neural networks (i.e., main and target networks) to capture the state information of inputs and address the curse of dimensionality issue of RL. In sentiment analysis, RL and DRL reduce the need for a large labeled dataset and linguistic resources, increasing scalability and preserving the context and order of logical partitions. Through enhancement, the RL and DRL algorithms identify negations, enhance the quality of the generated responses, predict the logical partitions, remove the irrelevant aspects, and ultimately capture the correct sentiment polarity. This paper presents a review of RL and DRL models and algorithms with their objectives, applications, datasets, performance, and open issues in sentiment analysis.

Systematic Literature Review on Sentiment Analysis in Airline Industry

Article 23 December 2024

Sentiment analysis using deep learning techniques: a comprehensive review

Article 23 November 2023

Sentiment analysis with deep neural networks: comparative study and performance assessment

Article 22 May 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

What is sentiment analysis? Sentiment analysis in natural language processing (NLP), is a technique aims to understand human sentiments (or emotions) and preferences. Human sentiments are generally grouped into categories (e.g., “positive”, “neutral”, and “negative”) to understand human behaviors and preferences. Sentiment analysis has been widely used in various applications, including digital marketing to help businesses (or sellers) understand customer sentiments and preferences for different products and services. The extracted sentiments are part of the decision-making factors for businesses to take further actions, such as contacting prospects personally and offering new products, to improve customer intention to purchase, contributing to improved sales and customer relationships. However, sentiment analysis can be challenging because it: (a) requires large labeled datasets; and (b) is prone to misinterpreting some sentences, such as sarcasm or negations.

What is reinforcement learning? Reinforcement learning (RL) enables a decision maker (or an agent) to learn through interactions with the operating environment. The agent observes the operating environment (the current state) and selects the optimal action at the current decision epoch t, and then receives feedback signals (reward) from the operating environment at the next decision epoch. The agent repeats these steps and learns optimal actions for different states over time, leading to an improved cumulative reward. Deep reinforcement learning (DRL) extends RL with deep neural networks, which use layers of neurons to capture and store the input state information (e.g., the features of an image), allowing it to generalize the large state space and provide a complex mapping between states and actions. Hence, DRL addresses the curse of dimensionality issue of RL due to the large number of state and action mappings in the tabular format. DRL uses two deep neural networks, namely the main and target networks, to estimate Q-values and target Q-values, respectively, used for estimating loss and updating the weights of the main network during training.

Why (deep) reinforcement learning? Using D(RL) in sentiment analysis has four main advantages. First, D(RL) reduces the need for a large amount of manually labeled data used in training supervised machine learning approaches. The data consists of a large number of features to capture the large variance of language expressions. An insufficient amount of manually labeled data has been shown the root cause of poor generalization and overfitting (Wang et al. 2021). Unfortunately, collecting and labeling a large amount of data manually has two main disadvantages because these tasks are: (a) costly, particularly when the granularity level is lower whereby the word level is costlier and requires less supervision compared to the document level; and (b) prone to human subjectivity and misinterpretation. Second, D(RL) reduces the need for a large linguistic resource, such as a dictionary, to understand sentiment ambiguity as seen in lexicon-based approaches (Obiedat et al. 2022). Third, D(RL) has the capability of processing sentences with a diverse range of length scales (Keerthana et al. 2021), and this has been achieved through identifying aspects (e.g., the characteristics of a product or service in a review) and segregating texts into logical partitions (e.g., words and phrases) (Wang et al. 2019), such as the enhancement shown in the machine learning model called seq2seq (Vinyals and Le 2015; Luong et al. 2015). Fourth, D(RL) preserves the context and order of logical partitions (e.g., words and sentences), while existing machine learning approaches tend to encode sentences into fixed-length embeddings that cannot preserve the order and context of logical partitions (Pröllochs et al. 2020).

1.1 Our contributions

This paper reviews the enhanced approaches of RL and DRL in sentiment analysis. The key contributions are as follows:

This paper reviews the main aspects of RL and DRL in sentiment analysis, including objectives, applications, performance metrics, datasets, and algorithms, offering a thorough understanding of their implementations.
This paper highlights and discusses open issues of RL and DRL in sentiment analysis, stimulating further research and investigation. These open issues provide a roadmap for future research directions and opportunities.
To the best of our knowledge, this is the first review paper that specifically addresses the use of RL and DRL in sentiment analysis, filling a significant gap in the existing literature.

1.2 Organization of this paper

The rest of this paper is organized as follows. Section 2 presents an overview of RL and DRL. Section 3 outlines the objectives, applications, and performance metrics used in RL and DRL for sentiment analysis. Section 4 presents the datasets, pre-processes, implementation details, and hyper-parameters for RL and DRL in sentiment analysis. Section 5 reviews the RL approaches in sentiment analysis and summarizes their enhancements and attributes. Section 6 presents open issues and future directions. Finally, Sect. 7 concludes this paper. The acronyms and notations used in this paper are summarized in Tables 1 and 2, respectively.

Table 1 Acronyms

Full size table

Table 2 Notations for reinforcement learning, deep reinforcement learning, and sentiment analysis

Full size table

1.3 Literature searching process

There has been a notable increase in research interest in applying RL and DRL to sentiment analysis over the years. Figure 1 shows the distribution of related papers published at different times. This distribution highlights the increasing interest and research activities in the enhancement of RL and DRL in sentiment analysis over the past 20 years. By examining the publication trends, we can identify the advancements and shifts in research focus. This trend analysis also highlights the need for a comprehensive review at this point in time, as the field continues to evolve and expand.

The literature searching process consists of four main stages: identification, screening, eligibility assessment, and categorization of potential research articles. The PRISMA diagram (Page et al. 2021) was used to systematically document this process, with Fig. 2 illustrating each stage of the literature search.

In the identification phase, we executed queries in Scopus and Web of Science (WOS) to gather articles from the sources known for their thorough coverage of related literature (Table 3). These databases were chosen for their rich content and inclusion of highly reputable journals (Singh et al. 2021). The query specified from year 2018 to 2024 to provide the extensive coverage of sentiment analysis and artificial intelligence topics within these years. Initially, we identified 484 papers from our search results.

Table 3 Queries to search for relevant publications in Scopus and WOS

Full size table

This review aims to provide a thorough and detailed research on RL and DRL in sentiment analysis within the domain of NLP. To ensure the relevance and quality of the studies included in this review, specific inclusion and exclusion criteria were established in the screening phase, as shown in Table 4. This process included reviewing titles and abstracts to eliminate unrelated studies. During the eligibility phase, we assessed full-text articles to confirm their compliance with our inclusion criteria. Finally, the sorting phase entailed selecting the final articles for review.

Table 4 Inclusion and exclusion criteria

Full size table

2 Background

This section provides the essential background information on RL and DRL, which is crucial for understanding the subsequent sections on RL applications in sentiment analysis. To begin, this section introduces the fundamental concepts of RL and Markov decision process (MDP), which forms the basis for RL algorithms. Section 2.1 discusses key RL approaches, including Q-learning, and emphasizes the balance between exploration and exploitation. Section 2.2 introduces the REINFORCE variant, a significant policy gradient algorithm. Finally, Sect. 2.3 presents the formulation of DRL approach, represented by the deep Q-network (DQN).

2.1 Reinforcement learning

Reinforcement learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with its environment. The agent’s goal is to identify the optimal policy, which is a mapping between states and actions that maximizes cumulative rewards. RL is used to solve Markov decision process (MDP) problems, which involve a sequential decision-making process where outcomes are partly random and partly under the control of the agent (Sutton et al. 1999). Figure 3 shows the abstract RL model. The agent explores different actions $a_t \in A$ under different states $s_t \in S$, and receives rewards $r_{t+1}(s_{t+1})$. Through online trial-and-error interactions with the operating environment, the agent learns the optimal policy $\pi _t^*(s_t)$ with exploration and exploitation processes.

Q-learning is a popular RL approach that uses the Q-function to estimate the Q-value $Q_t(s_t,a_t)$ of a state-action pair. The Q-value represents the expected cumulative reward of taking action $a_t$ in state $s_t$ and following the optimal policy thereafter. The Q-function is given by:

$$\begin{aligned} Q_{t+1}(s_t,a_t) = Q_{t}(s_t,a_t) + \alpha (r_{t+1}(s_{t+1}) + \gamma \max _{a \in A} Q_{t}(s_{t+1},a) - Q_{t}(s_t,a_t)) \end{aligned}$$

(1)

where $0 \le \alpha \le 1$ represents the learning rate, which determines the level of changes of Q-values in RL, and $0 \le \gamma \le 1$ represents the discount factor, which determines the effect of long-term reward. The term $(r_{t+1}(s_{t+1}) + \gamma \max _{a \in A} Q_{t}(s_{t+1},a) - Q_{t}(s_t,a_t)$ represents the target value used in the Q-learning update rule. This target value is composed of two components, namely the immediate reward $r_{t+1} (s_{t+1})$ received after taking action $a_t$ in state $s_t$, and the best possible future reward $\gamma \max _{a \in A} Q_{t}(s_{t+1},a)$ expected from the next state $s_{t+1}$. The update rule adjusts the current Q-value towards this target value to improve the agent’s estimation of the future rewards.

During exploitation, the agent selects the optimal action $a^*_{t}$ for a given state $s_t$ using the optimal policy $\pi ^*(s_t)$, which is determined by:

$$\begin{aligned} {\pi }^{*}(s_t) = {\arg } {\max }_{a} Q(s_t,a) \end{aligned}$$

(2)

where ${\arg } {\max }_{a}$ represents the action of a that maximized the Q-value $Q(s_t,a)$.

As time goes by, the agent learns the optimal policy that maximizes the state value, which is the cumulative reward starting from state $s_t$. The state value function $V_{\pi _t}(s_t)$ is defined as:

$$\begin{aligned} V_{\pi _t}(s_t) = E[r_{t+1}(s_{t+1})+\gamma r_{t+2}(s_{t+2})+\gamma ^2 r_{t+3}(s_{t+3})+\cdots ] \end{aligned}$$

(3)

This equation represents the expected cumulative reward, where future rewards are discounted by the discount factor $\gamma$. Due to the recursive nature of the state value function, it can also be expressed as:

$$\begin{aligned} V_{\pi _t}(s_t)=r_{t+1}(s_{t+1})+\gamma V_{\pi _t}(s_{t+1}) \end{aligned}$$

(4)

To balance exploration and exploitation, two popular approaches are used: (a) the $\varepsilon$-greedy approach explores random actions (exploits actions with the highest Q-value) with probability $\varepsilon$ (probability $1-\varepsilon$); and (b) the softmax approach explores random actions (exploit actions with higher Q-values) more often when temperature $\tau$ increases (reduces).

2.2 REINFORCE

The REINFORCE algorithm, introduced by Williams in 1992, is a foundational method in policy gradient RL (Williams 1992). It optimizes policy using two key steps: (a) Monte Carlo sampling, which involves collecting trajectories that consist of sequences of states, actions, and rewards based on the current policy, estimates the cumulative reward of these trajectories; and (b) policy gradient, which updates the policy parameters to increase the probability of selecting actions that yield higher cumulative rewards. Unlike the traditional RL methods that rely on temporal difference learning or bootstrapping, REINFORCE learns directly from actual experiences. However, this method can lead to increased variance due to the randomness in the rewards of each trajectory (Sutton et al. 1999).

2.3 Deep reinforcement learning

Deep Q-network (DQN) has two artificial neural networks (ANNs), namely the main network and the target network as shown in Fig. 4 (Mnih et al. 2015). Each network has multiple hidden layers, and each layer consists of multiple neurons that capture and store knowledge. Each neuron uses different network parameters (e.g., weight W) to define the contributions of the neuron to the next layer. The output layer produces Q-values.

During action selection, the input signals traverse across the main network through the input, hidden, and output layer. Optimal action $a^*_t$ is selected based on the estimated Q-values $Q_t(s_t,a_t)$ provided by each neuron in the output layer. The state $s_t$, action $a_t$, delayed reward $r_{t+1}(s_{t+1})$, and the next state $s_{t+1}$ of a decision epoch $t={1,2,\dots ,T}$ is stored as an experience tuple $e_t = (s_t,a_t,r_{t+1}(s_{t+1}),s_{t+1})$ in the replay memory for training purpose. During training, the agent calculates the loss (or error) that compares expected outputs from the target network and real outputs from the main network. The loss, L, in the form of gradient values, traverses across the main network through the output, hidden, and input layers using backpropagation. With gradient descent, the weights (or network parameters) over the links of the main network are adjusted to minimize loss. Algorithm 1 represents the DQN algorithm adapted from (Mnih et al. 2015).

3 Attributes of reinforcement learning in sentiment analysis

This section introduces and discusses the key attributes of RL approaches in sentiment analysis, as illustrated in Fig. 5. These attributes include: (a) objectives to clarify the goals and significance of RL approaches; (b) applications to demonstrate the practical uses; and (c) performance metrics to assess the effectiveness of RL approaches. The subsequent paragraphs provide a detailed explanation of these attributes.

3.1 Objectives of sentiment analysis

Sentiment analysis in NLP aims to interpret opinions in text data, such as customer reviews. However, the traditional sentiment analysis often faces challenges such as: (a) requiring a large labeled dataset; and (b) failing to capture nuanced sentiment polarities for specific aspects of a sentence. The enhanced RL approaches address the challenges in sentiment analysis while achieving the following objectives:

O.1
Identifying logical partitions segregates texts (e.g., customer reviews) into logical partitions at different levels of granularity, such as aspect (i.e., the characteristic of a product, such as the taste of a meal), entity (e.g., brand and product), emotion (e.g., joy and sadness), word, phrase, sentence, and document. The aspect level is popularly used in sentiment analysis using RL. An aspect has either a substantial object (e.g., car) or a conceptual object (e.g., service), and it has either a single word (e.g., car) or multiple words (e.g., car interior). For example, as shown in Fig. 6, in the “I like the car interior, but I do not like the exterior.” sentence, the “car exterior” has a positive sentiment, and the “car interior” has a negative sentiment (Wang et al. 2019).
O.2
Classifying sentiments (or emotions) of logical partitions into categories. Sentiment polarity classifies the sentiments of logical partitions into either positive, neutral, or negative as shown in Table 5. Scores can represent the intensity of the sentiments of logical partitions. The score can be either a positive ($>0$), a neutral (0), or a negative ($<0$) value. The score is adjusted by valence shifters, including negation words (e.g., “unhappy”), amplifying words (e.g., “very happy”), and deamplifying words (e.g., “quite happy”). RL can classify sentiments in real time, so it addresses the dynamics of the sentiment polarity. For instance, customers’ sentiments may change as they browse through products online, so sellers (i.e. the chatbots) should understand the customers’ experience and determine the customers’ intention to purchase (Eshak et al. 2017) to decide the next step as time goes by Zhang et al. (2019a), Wang et al. (2021), Li et al. (2018) and Pröllochs et al. (2020).
O.3
Adjusting responses provides response (output) with the right sentiments, such as chatbots making a conversation more personal and friendly. The purpose is to change the polarity p from a positive sentiment to a negative one, and then change it back to positive to match with the original sentiment (Lee et al. 2018; Keerthana et al. 2021).
O.4
Extracting correct sentiment words for aspects, in which each sentiment word may represent a different sentiment polarity for the relevant aspects. For instance, the aspect is use and its sentiment word is ease in the “... the ease of use” clause, and the aspect is price tag and its sentiment word is well worth in the “... make it well worth the price tag” clause (Dai et al. 2022). Re-reading process refines the sentiment word from popular to so popular and removes nightmare for the food aspect in the “... food is so good and so popular that waiting can really be a nightmare” clause (Yang et al. 2019).
O.5
Extracting aspects from unlabeled data is a fine-grained task useful in classifying the sentiment polarities of aspects. For instance, the aspects are mushroom soup and coleslaw in the “I recommend the mushroom soup and the coleslaw as starters” clause (Venugopalan and Gupta 2022).

Table 5 Examples of the restaurant reviews with multiple aspects having different sentiment polarities

Full size table

3.2 Applications

Sentiment analysis based on RL has been applied in various applications. Being the points of interaction with users, the applications serve as data sources providing essential data for training RL in sentiment analysis. The applications are as follows:

A.1
Social networks, such as Twitter (Sushmitha et al. 2022), support electronic commerce (e-commerce) and social commerce (or s-commerce) (Eshak et al. 2017). Social networks allow online communities to interact with each other and meet their own social goals. Data can be extracted and collected from social networks using application programming interfaces (APIs) features, such as querying, streaming, collecting, and searching for related user comments (Sushmitha et al. 2022). Social networks are popular data sources for customer reviews as sellers can sell products, and customers can compare and purchase products, and share information and comments on the platforms (Yang et al. 2019; Wang et al. 2021).
A.2
Electronic commerce (e-Commerce) websites, such as Amazon (Nagamanjula and Pethalakshmi 2018) and Google (Kumar 2023), allow users to provide user ratings (e.g., a scale of 1 to 5) and comments (Venugopalan and Gupta 2022; Wang et al. 2021; Dai et al. 2022). Data can be extracted and collected using web scraping tools, such as Python libraries (e.g., BeautifulSoup Uzun et al. 2018) and methods (e.g., “requests.get()” Chandra and Varanasi 2015).
A.3
Chatbots allows customers to communicate with the sellers about the products or services (Lee et al. 2018; Keerthana et al. 2021).
A.4
Speech, particularly recorded speech, allows users to share their opinions, thoughts, emotions, judgments, and so on, about different topics (e.g., financial, movie, marketing) through conversation (Zhang et al. 2019a). Data can be extracted and collected as audio and visual features.

3.3 Performance Metrics

There are seven main performance metrics as follows:

P.1
Higher prediction performance, including: (a) a lower mean absolute error (MAE) that measures the average difference between the predicted (or estimated) and actual (or the ground truth) sentiment polarities without considering the positive and negative directions of the values; (b) a lower mean squared error (MSE) that measures the average of the squared difference between the predicted and actual sentiment polarities; (c) a higher accuracy that measures the ratio of the correctly predicted sentiment polarities to the total number of predictions made; d) a higher bilingual evaluation understudy ${\textit{BLEU}}=n_r/n_a$ that indicates a higher similarity between the predicted and actual outputs, where $n_a$ represents the number of words in the real output and $n_r$ represents the number of similar words between the predicted and actual outputs (Pröllochs et al. 2020; Wang et al. 2019; Keerthana et al. 2021); and (e) a lower negative log-likelihood (NLL) that indicates a higher capability in generating text with more accurate emotions and context (Li et al. 2018).
P.2
Higher classification performance (Venugopalan and Gupta 2022; Li et al. 2018; Yang et al. 2019; Zhang et al. 2019a; Wang et al. 2021; Dai et al. 2022), including a higher precision $P=TP/(TP+FP)$, recall $R=TP/(TP+FN)$, accuracy $A=(TF+TN)/(TP+TN+FP+FN)$, and f-measure $F=2 \times P \times R/(P+R)$ (Eshak et al. 2017; Nagamanjula and Pethalakshmi 2018), where TP represents true positive, TN represents true negative, FP represents false positive, and FN represents false negative. For an imbalanced dataset, the geometric mean (or G-mean) $\sqrt{{\textit{sensitivity}} \times {\textit{specificity}}}$ is used, specifically:
$$\begin{aligned} G\text {-mean} = \sqrt{\frac{TP}{TP+FN} \times \frac{TN}{TN + FP}} \end{aligned}$$
(5)
P.3
Higher performance based on baseline models compares performances achieved between baseline and proposed models. For example, the seq2seq (Lee et al. 2018; Vinyals and Le 2015; Luong et al. 2015) and recurrent neural network (RNN) discriminators are compared in terms of: (a) response accuracy (i.e. the similarity of semantics between inputs and outputs; (b) the positive sentiment level of the output; and (c) the accuracy of the output grammar.
P.4
Higher performance based on human evaluation invites human subjects to perform subjective human evaluation on the response outputs. Examples of questions are (Lee et al. 2018): (a) whether the response output is a good response to the input sentence (coherence-related); (b) whether the response output is positive (sentiment-related); and (c) whether the response output is grammatically correct (grammar-related).
P.5
Higher RL performance includes a higher cumulative reward that helps to induce a positive human sentiment (Keerthana et al. 2021).
P.6
Smaller number of selected data for sentiment analysis helps to reduce the annotation cost of time and effort required to label data for sentiment analysis (Venugopalan and Gupta 2022).

4 Essentials for training RL in sentiment analysis

This section introduces and discusses the essentials for training RL in sentiment analysis. These essentials include: (a) datasets to ensure the validity and generalizability of the training models; (b) pre-processors to prepare the raw data for analysis by performing tasks such as text normalization and tokenization; (c) implementation details to understand the architecture or environment in use; and (d) parameters and values to optimize the performance of the RL models. The subsequent paragraphs provide a detailed explanation of these attributes, emphasizing their importance and role in the training process.

4.1 Datasets

RL uses opinion and review datasets for training in sentiment analysis. Table 6 presents the datasets used to train RL in sentiment analysis. D.1–2 are recorded as video, while D.3–13 are represented as text. Opinion datasets (D.1–2, D.7–9, D.13) cover the conversation of two or more individuals about their point of view on various topics. The news dataset (D.12) covers news articles reporting the world’s incidents. Review datasets (D.3–6, D.10–11, D.14) cover the user feedback of restaurants, electrical products, hotels, and so on.

Datasets used in sentiment analysis are generally raw and unstructured. Pre-processes process the inconsistent and incomplete data, including modifying and organizing relevant data, and removing noise/irrelevant data. There are seven main pre-processes to generate clean data, including processes to: (a) tokenize text into smaller tokens (e.g., words) based on punctuation marks and white spaces; (b) remove stop words and punctuation marks; (c) replace contractions (e.g., “he’ll” is replaced by “he will” and “I’m” is replaced by “I am”); (d) remove special characters, hashtags, and user identification; (e) change emoticons into words for interpretation; (f) correct spelling errors and address abbreviations; and (g) remove affixes.

Table 6 Datasets for training RL in sentiment analysis

Full size table

4.2 Pre-processes

To further pre-process the clean data, six main pre-processing approaches are used, as presented in Table 7. Different kinds of Python and R packages used in these processes are listed in Table 8.

Table 7 Pre-processes

Full size table

Table 8 Tools for pre-processes

Full size table

The input of the RL agent, particularly the state, is encoded. GloVe (Pennington et al. 2014) creates word embeddings, that is, transforming words into a numerical representation. This helps to understand the semantic and relationships between words, e.g., “queen” and “king” will be closer in the vector space as they are often mentioned together. The COVAREP software (Degottex et al. 2014) extracts acoustic (or audio) features from recorded speech. Examples of features include the glottal source parameters (Drugman et al. 2011), maxima dispersion quotients (Kane and Gobl 2013), peak slope parameters, twelve mel-frequency cepstral coefficients, pitch, and voiced/ unvoiced segmenting features (Drugman and Alwan 2019). P2FA (Yuan and Liberman 2008) performs temporal alignment between text and audio at the word level.

4.3 Implementation details

Details about the implementation of RL approaches in sentiment analysis are limited in the literature. Before the training of review datasets (Yang et al. 2019), the hyper-parameters can be automatically tuned using the grid search algorithm (Bergstra et al. 2013). During training, data is passed to the RL algorithm locally on GPUs, such as NVIDIA GeForce GTX-TITAN (Li et al. 2018). To avoid overfitting, the early stopping mechanism is applied with a window size of 5 (Yang et al. 2019).

Using the NLTK library in Python, the BLEU score is calculated for evaluating the quality of a sentence (Keerthana et al. 2021). The Numpy library and Matplotlib are used to plot the cumulative reward of RL (Keerthana et al. 2021). With PyTorch-Transformers, BERT embeddings are generated using a stochastic gradient descent optimizer to optimize the base model, specifically the BiLSTM-CRF network (Venugopalan and Gupta 2022). For the self-attention RNN in Venugopalan and Gupta (2022), the Adam optimizer from PyTorch was used. Training was conducted over 5000 epochs to minimize loss and stabilize the gradient.

4.4 Hyper-parameters and values

During pre-processing, the number of vectors is a hyper-parameter in the GloVe algorithm. The higher the dimension of a vector for each word, the richer the semantic representations (e.g., meaning and context) of the word. A 300-dimensional GloVe vector is pre-trained on a large corpus (Yang et al. 2019), such as 840 billion words (Dai et al. 2022). In the top layer of classification, a 1024-dimensional vector is used to enhance the semantic connection (Li et al. 2018).

During the RL process, there are three main hyper-parameters being carefully considered in each approach as follows:

H.1
Number of layers in neural network balances model complexity and computational efficiency. Depends on the network, a sufficient depth can capture complex patterns in the data while avoiding overfitting. The number of layers is set to 3 for graph neural network in Dai et al. (2022), 2 for policy network in Wang et al. (2021), and 2 for neural network in Keerthana et al. (2021).
H.2
Learning rate$(0 \le \alpha \le 1)$ determines the step size of Q-value update in RL (or the network parameter update in DRL). In sentiment analysis, a lower learning rate decreases the number of sentences for a particular domain in a dataset for training. Venugopalan and Gupta (2022), a low learning rate is set to the agent of RNN, i.e. $\alpha =0.0001$. This ensures a consistent sentiment predictions and minimizes drastic changes.
H.3
Temperature$(0 \le \tau \le 1)$ determines the step size of the Q-value update in RL. Higher temperature increases exploration by promoting random actions. Li et al. (2018), a high temperature is set to initialize the training, i.e. $\tau =1$. As $\tau$ decreases by time, the action with the highest estimated Q-value is chosen (Wang et al. 2021; Yang et al. 2019).

5 Reinforcement learning approaches in sentiment analysis

RL has emerged as a useful technique for sentiment analysis, such as identifying grammar logic and sentiment polarities, adjusting responses, and extracting context. This section presents a review of: (a) RL and DRL approaches, including RL with baseline models, RL with episodic rewards, DRL with LSTM, ANN, actor-critic RL (ACRL), and Q-value distributions; and (b) their enhancements, including DRL with dependency graphs and DRL with multiplex heterogeneous graph. Table 9 presents a summary of the RL approaches and enhancements in sentiment analysis, highlighting their advantages that motivate this study.

Table 9 Summary of the RL approaches and enhancements in sentiment analysis

Full size table

5.1 Traditional reinforcement learning with baseline models

Lee et al. (2018), RL enables a chatbot (A.3) to provide responses with the right sentiments (O.3) using the textual feature at the sentence level. The challenge is to make responses more personal and friendly.

The agent extends the seq2seq model to generate meaningful responses with the right sentiments, which are based on the context of a conversation, given a user input (Vinyals and Le 2015; Luong et al. 2015). The state $s_t$ represents an input sentence. The action $a_t$ represents a response output to the input sentence. The reward function $r_{t+1}(s_{t+1})$ uses three baseline models to generate three metrics representing semantic coherence to generate a response output similar to the reference output. First, the response output $a_t$ given the input sentence $s_t$ based on the seq2seq model. Second, the score that represents whether the input sentence $s_t$ and the response output $a_t$ form a coherent dialogue pair. An RNN discriminator uses two RNN encoders to encode the input sentence $s_t$ and the response output $a_t$ as embeddings, concatenates them, and feeds them into a fully connected layer to generate the score. Third, the sentiment score of the response output $a_t$ that represents how positive the response output is. A sentiment classifier, which is trained using a corpus of sentences with labeled sentiments, determines the sentiment score of the response output.

The model is trained using the Marsan-Ma’s Twitter chatting corpus dataset (D.7). The proposed solution has been shown to achieve a higher performance. First, based on human evaluation (P.4), in which 30 human subjects provide subjective responses on whether the response outputs have the right coherence, sentiment, and grammar. Second, based on baseline models (P.3), including a higher probability or score based on the seq2seq model, RNN discriminator, and language model (Mikolov et al. 2010).

Traditional models like seq2seq and plug and play models (Nguyen et al. 2017) have been widely used to analyze sentiment and generate responses. However, these models often struggle between adjusting responses and preserving logical semantics. They are either generating sentences that are semantically wrong, or they preserve the semantic but are less personal and unfriendly (e.g., sounds robotic). This challenge highlights an ongoing issue in natural language processing, that is, the trade-off between a diverse response and maintaining sentence quality. By integrating baseline models into the reward function of RL, this challenge is addressed. The RL agent learns to: (a) align the output’s semantics with the baseline seq2seq model; (b) match the semantic structure of the baseline RNN model; and (c) achieve a high sentiment score based on the baseline sentiment classifier. Through this approach, the RL agent can generate more varied and contextually appropriate responses.

5.2 Traditional reinforcement learning with episodic rewards

Pröllochs et al. (2020), RL detects and interprets negations to classify sentiments (O.2) using the textual feature at the document level. Negations, which are expressed using explicit words (e.g., “not”) and implicit words (e.g., “forbid”), reverse the meaning of a word, phrase, or sentence. Each document, which is a review with single or multiple sentences, has a single real sentiment polarity (i.e., the ground truth). Words in the document have meanings based on their own context established by their arrangements (or orders) and dependence on other words in the document. The challenge is that negations are subjective across different humans due to their latent and unobservable natures.

In traditional RL, the agent receives rewards immediately after classifying sentiment polarity. This instant feedback can cause the agent to misinterpret the sentiment of sentences with negations, as it does not fully consider the context. In contrast, episodic rewards delay the feedback until the end of a document. The agent reviews the entire context before making its final sentiment classification. This approach addresses the challenge of inaccurate immediate rewards, which might overlook crucial words later in the sentence that can change the sentiment polarity.

The agent processes the sequence of words in a document word-by-word and selects whether or not to negate the sentiment polarity of the document. As shown in Fig. 7, the state $s_t$ represents a vector of: (a) the current word at decision epoch t; and (b) the previous action at decision epoch $t-1$. So, the state has a recurrent nature because of the historical information. The action $a_t$ is whether or not to negate the sentiment polarity of the document at decision epoch t. Subsequent words may or may not negate the sentiment polarity of the document at decision epochs $t+1, t+2,\ldots , T$ until the end of the document at decision epoch T. The reward $r_T(s_T)$, which is received at the terminal decision epoch T, represents the expected error or loss between the actual (or estimated) and the real (or the ground truth) sentiment polarities of the document, given that the real sentiment polarities are labeled by humans representing human perceptions. As an example, in the “this fancy product isn’t good but fantastic” sentence, the “this” word initializes the sentiment polarity of the document to positive. The “isn’t” word negates the sentiment polarity to negative, and then the “but” word negates the sentiment polarity to positive, giving an overall positive sentiment polarity for the document.

The model is trained using the IMDb dataset (D.10). The proposed solution has been shown to achieve a higher prediction performance (P.1) with a higher similarity between actual and real sentiment polarities of documents.

This study has proven that episodic rewards are capable in accounting long-term dependencies and improving the accuracy of classification. Nonetheless, challenges remain. For example, RL struggles when the distribution of negations within documents is uneven. As shown in this study, the second half of movie reviews and financial news contains more negations than the first half. This inconsistency hinders RL models from consistently applying negation detection and sentiment analysis throughout a document. Therefore, further research is needed to refine these models for better capturing the nuances of natural language. Future work could involve developing a model that first differentiate the mathematical representations of various document types, then identify the relevant type before scanning it. This approach would help the model to determine how frequently to perform negation detection and sentiment analysis within a document. Meanwhile, it reduces the reliance on large, manually labeled datasets for negation.

5.3 Actor-critic reinforcement learning

Venugopalan and Gupta (2022), ACRL selects a minimal subset of data into the bag of informative data to represent the entire set of unlabeled training data (O.5). This helps to reduce the annotation cost of time and effort required to label data for sentiment analysis. The challenge is that the level of informativeness of the data in the bag is given by the set of data in the bag, which may change as new data is added into the bag as time goes by.

Using the actor-critic approach, the agent has two components: (a) actor selects actions using policy; and (b) critic evaluates the selected action using rewards. Self-attention RNN (Pal et al. 2020), which is capable of learning long sequences of data by enabling the inputs to interact with each other, is used to implement the actor-critic model for training. The state $s_t$ represents an input sentence. Given the input sentence, the action is $a_t = \{{\textit{add}}, {\textit{discard}}, {\textit{end}}\}$, where ${\textit{add}}$ adds it into the bag of informative data and ${\textit{discard}}$ doesn’t, and ${\textit{end}}$ marks the end of an episode so the agent is reinitialized. The reward $r_{t+1}(s_{t+1})$ has four parameters, and a higher reward indicates: (a) a larger semantic difference given by the average word mover distance between sentences in the bag (Kusner et al. 2015; b) a larger syntactic (sentence structure) difference (or the tree kernel difference (also called the parse tree difference)) given by the average distance between sentences in the bag (Moschitti 2006; c) a lower penalty for a smaller size of the bag to encourage a minimal subset; and (d) a higher bonus, which is based on the current action and the number of actions taken before this. A higher reward value indicates a higher chance of the input sentence being selected into the bag.

Figure 8 shows the abstract model of ACRL. The actor network receives the state and uses a forward pass to generate two outputs: (a) the probabilities of all possible actions being selected; and (b) the state value. The actor network selects the action based on the probabilities. Given the current state and the selected action, the environment calculates the reward. The critic network receives the reward and updates its state values. The agent selects ${\textit{end}}$ to end an episode, and then calculates and backpropagates loss back into the actor and critic networks.

The model is trained using the SemEval-2014 (D.3), SemEval-2015 (D.4), and SemEval-2016 (D.5) datasets, particularly Restaurant-2014, Restaurant-2015, Restaurant-2016, and Laptop-2014, which are comprised of user reviews of restaurants and laptops (A.2). The proposed solution has been shown to achieve a higher classification performance (P.2), particularly f-measure, precision, and recall, and also smaller datasets required for training (P.6).

Li et al. (2018), the ACRL approach has been utilised. It aims to train the generator to produce text sequences that fool the discriminator and align well with the specified categories (O.2). The generator learns to adjust its policy to maximise the reward signal received from the critic, ultimately enhancing the quality and category alignment of the generated text. In the context provided in Li et al. (2018), the actor represents the text generator, while the critic has a descriptor system including both a discriminator and a classifier. The discriminator assesses the generated text for realism or authenticity, distinguishing between generated and real text. The classifier evaluates the category accuracy of the generated sentences, determining whether they align with the desired categories or labels, which typically include sentiments like positive, negative, or neutral.

The actor, which is the generator, selects the next token in the text sequence based on the current state (including the previous tokens generated) and prior information (i.e., the category labels). The critic, which is the discriminator, provides rewards to the generator based on the quality of the selected token. The reward increases with the accuracy of the discriminator in distinguishing between real and synthetic tokens. The classifier evaluates how accurately the generated text matches the specified categories and provides an additional reward.

The model is trained using the Stanford Sentiment Treebank (D.11), NEWS (D.12), and Crowdflower (D.13) datasets comprises of reviews about movie, news, and opinions posted as tweets, respectively. The proposed solution has been shown to achieve a higher classification performance (P.2), particularly f-measure and precision, and a higher prediction performance, particularly in NLL (P.1).

Both actor-critic RL approaches have improved the classification performance in sentiment analysis. For instance, ACRL has been shown to increase the f-measure by 9–17 points compared to random sampling and significantly reduces the data needed for training (Venugopalan and Gupta 2022). While traditional supervised models often require retraining when encountering a new environment, ACRL shows its adaptability in a dynamic environment. Nonetheless, there is space for future research. For example, limited data resources in a certain language can hinder the performance of sentiment analysis. A confidence scoring mechanism can be integrated into actor-critic RL to help the model in determining the need for additional annotations to the data.

5.4 Deep reinforcement learning

Yang et al. (2019), DRL re-evaluates the learned sentiment word and revises them if they are incorrect (O.4) using the textual feature at the aspect level. This is to address the challenge of mimicking the reading process in which humans tend to understand deeper and correct errors after reading a document.

The state $s_t$ is the word in the sentence. The action $a_t$ is: (a) whether or not to remove that word from the sentence at decision epoch t in the pre-reading phase; and (b) predicts the sentiment polarities of the aspect at decision epoch t in the post-reading phase. The reward $r_{t+1}(s_{t+1})$ increases with the probability of an accurate prediction of the real sentiment polarities (i.e., the ground truth) of aspects given the estimated sentiment polarity.

As shown in Fig. 9, the weights of the policy network are updated using loss that reduces with: (a) a higher reward; (b) a lower number of sentiment words maintained in the document to reduce the number of irrelevant words; and c) a lower coefficient for $l_2$ regularization used to prevent from overfitting.

The model is trained using the SemEval 2014 Task 4 dataset (D.3) comprised of user reviews of restaurants and laptops, and TWITTER (D.6). The proposed solution has been shown to achieve a higher classification performance (P.2), particularly accuracy and f-measure.

5.5 Deep reinforcement learning with long short-term memory

5.5.1 Long short-term memory

Long short-term memory (LSTM) extends RNN to address the traditional vanishing and exploding gradient problems during backpropagation (Hochreiter and Schmidhuber 1997). LSTM processes sequential data and predicts the next element in a sequence while taking contextual information (e.g., word order) into consideration. LSTM has been particularly useful in natural language processing (Aftab et al. 2021), speech recognition, and translation. LSTM has been used to extract the important aspects of a sentence and classify the sentiment polarity (Heryadi et al. 2023; Sushmitha et al. 2022).

As shown in Fig. 10, LSTM uses a gating mechanism that consists: (a) input gate passes information through a sigmoid function to decide which information should update the cell state; (b) forget gate passes information through a sigmoid function to decide which information should be kept or forgotten; and (c) output gate passes information through a tanh function to decide the next hidden state. Compared to the traditional LSTM model, proposed models in sentiment analysis typically have more layers: (a) multiple input layers and embedding layers convert different logical partitions of texts at different levels representing customer feedback (e.g., sentences and aspects Heryadi et al. 2023; b) multiple LSTM layers to process different logical partitions (e.g., the forward and backward phrases in a sentence Sushmitha et al. 2022); and (c) multiple dense layers (Heryadi et al. 2023) with different activation functions (e.g., ReLU, sigmoid, and tanh) to learn complex data in a non-linear pattern.

5.5.2 The proposed deep reinforcement learning with long short-term memory approach

Zhang et al. (2019a), DRL classifies sentiments (O.2) expressed through speech (A.4) using two modalities, namely acoustic and textual (or the transcript of the acoustic feature) at the clause level. Each speech has multiple utterances, each utterance has multiple clauses, and each clause is a collection of words with a subject and a verb. The GloVe word embedding encodes the words of the textual feature (Pennington et al. 2014), and the COVAREP software extracts the acoustic feature. P2FA ensures a temporal alignment between both text and audio at the word level (Yuan and Liberman 2008).

As shown in Fig. 11, the agent identifies logical partitions, specifically when the end of each clause of a speech occurs, to provide the clause structure. Then, the agent classifies the sentiment polarity of each clause into either positive or negative. The state is a vector $s_t=(h^t_{i-1},c^t_{i-1},h^a_{i-1},c^a_{i-1},h^t_i,c^t_i,h^a_i,c^a_i)$, where the sub-states are provided by two LSTM units that stores the sequential text and acoustic features providing: (a) the current hidden state of word i, including $h^t_i$ and $h^a_i$ for the text and acoustic features, respectively; (b) the current memory cell of word i, including $c^t_i$ and $c^a_i$ for the text and acoustic features, respectively; and (c) the previous hidden state and memory cell, namely $h^t_{i-1}$, $c^t_{i-1}$, $h^a_{i-1}$, and $c^a_{i-1}$. The action is $a_t = \{go, {\textit{end}}\}$, where ${\textit{end}}$ indicates the current word i is the end of the clause, and go represents otherwise. Softmax selects the exploration action. The reward $r_{t+1}(s_{t+1})$ is a function of the classification probability for positive and negative categories, and the number of clauses and words in the utterance when the current word i is the end of an utterance, or the reward is $r_{t+1}(s_{t+1})=0$ otherwise.

The model is trained using multimodal datasets, namely CMU-MOSI (Zadeh et al. 2016) (D.1) and CMU-MOSEI (Zadeh et al. 2018) (D.2). The proposed solution has been shown to achieve a higher classification performance (P.2), particularly accuracy and f-measure. Traditional DRL agent struggles in identifying the timing of a speech (e.g., acoustic signals) with the corresponding text (e.g., transcription) at the word level, especially when the speech data contains pauses, filter words, or different tones. In this approach, LSTM has proven its ability in providing detailed information for an RL agent through input, forget and output gates. With this updated information, the RL agent understands the ending of a clause and classifies the correct sentiment polarities.

5.6 Deep reinforcement learning with dependency graphs

Wang et al. (2021), Zhao et al. (2024) and Wu et al. (2024), DRL with fully connected networks classifies sentiments (O.2) while minimizing the effect of irrelevant words using the textual feature at the aspect level (Thet et al. 2010). Each document (e.g., a review (A.2)) has a single or multiple aspects. The challenge is that the document has irrelevant words (or noise).

The agent: (a) transforms input texts into dependency graphs (Covington 2001) (see Table 7); (b) explores aspect-sentiment paths from a target aspect node to a potential sentimental region, while avoiding irrelevant regions, on the dependency graphs; and (c) differentiates the effectiveness of different aspect-sentiment paths. The agent classifies the sentiment polarity of each aspect (i.e., each node in the dependency graph) into either positive, neutral, or negative. Zhao et al. (2024) and Wu et al. (2024), the state $s_t$ represents the current dependency graph structure of a sentence, where nodes are the words in the sentence, and edges are the syntactic dependencies between these words (see Fig. 12). The action $a_t$ adjusts the weights of each edge (dependency) in the graph. The agent optimizes the weights for a more accurate sentiment prediction of each aspect (e.g., food). The reward indicates improvement of prediction accuracy evaluated in each decision epoch, where $r_{t+1}(s_{t+1})=1$ represents a better accuracy and $r_{t+1}(s_{t+1})=-1$ otherwise. The policy network consists of hidden layers of bi-LSTM (Zhao et al. 2024) and a graph convolutional network to initialize and process the graph, respectively.

Wang et al. (2021), the state $s_t=(a_t,s_{t-1})$ represents the current action $a_t$ and the previous state $s_{t-1}$, which is maintained by LSTM. The action $a_t \in H = (v_1, v_2,\ldots , v_{|H|})$ selects a next hop (i.e., a word embedding) out of the possible next hops, where H represents the neighboring or surrounding nodes of the current node, to identify the most effective path. The exploration budget is limited, so irrelevant regions (or words) are skipped to explore effective paths. The path reward $r_{t+1}(s_{t+1})$ represents the mean squared error between the actual (or estimated) and the real (or the ground truth) sentiment polarities. When the agent arrives at a node, each pair of state $s_t$ and potential next-hop node $a_t \in H$ are separately fed into the policy network, and the next-hop node $a_t$ with the highest score is selected.

The policy network, as shown in Fig. 12, is a two-layer fully connected network. The length of each path is three hops, which is sufficient to capture the syntactical information, whereby a smaller number of hops cannot reach the sentiment word, and a larger number of hops aggregate too much information. At the end of each path, the agent uses a sentiment classifier, which is a single-layer network, to predict the path sentiment polarity $p = \delta (W_p s_t + b_p)$ given the current state $s_t$, where $\delta (\cdot )$ is the softmax activation function, $W_p$ are the weights, and $b_p$ is the bias. The weights of both policy network and sentiment classifier are updated simultaneously using gradients calculated based on the Grumbel softmax (Jang et al. 2016) and backpropagation.

The models are trained on TWITTER (D.6), SemEval-2014 Task 4 (D.3), SemEval-2015 Task 5 and Task 12 (D.4), SemEval-2016 Task 5 (D.5), and MAMS (D.14) (Wu et al. 2024) datasets comprised of tweets, restaurant, and laptop reviews. The proposed solutions have been shown to achieve a higher classification performance (P.2), particularly accuracy and f-measure. Dependency graphs address the challenge of noisy data source by structuring sentences into syntactic relationships. Traditionally, a DRL agent interacts with the graphs and focuses on meaningful connections between aspects and sentiment words. With the enhancement in weighing the edges (Zhao et al. 2024; Wu et al. 2024), the agent further shows improvement in filtering out noises and enhancing contextual understanding.

5.7 Deep reinforcement learning with multiplex heterogeneous graph

Dai et al. (2022), DRL extracts sentiment words (O.4), which represent the customer sentiment, for a given aspect in a sentence. The challenge is to identify correct sentiment words that may be far away from their corresponding aspect with the shortest possible distance.

The agent transforms an input sentence X into a multiplex heterogeneous graph $G = (V, E) = (G^{{\textit{seq}}}, G^{{\textit{syn}}})$, where V are nodes representing the words of a sentence and E are edges linking the nodes. The graph G includes: (a) a sequence subgraph $G^{{\textit{seq}}} = (V, E^{{\textit{seq}}})$ with each edge representing the sequential relationship between two consecutive nodes; and (b) a syntactic subgraph $G^{{\textit{syn}}} = (V, E^{{\textit{syn}}})$ with each edge representing the syntactic relationship between two nodes as shown in Fig. 13. Hence, $E = E^{{\textit{seq}}} \cup E^{{\textit{syn}}}$. Examples of a syntactic relationship are adjectival complement (acomp), conjunct (conj), determiner (det), direct object (dobj), negation (neg), and open clausal complement (xcomp). The agent also obtains the aspect label sequence $l^a$ and the predicted opinion label sequence $l^O$, which is obtained at the terminal state, by labeling every node and its position using the BIO scheme, where $l = \{B: {\textit{beginning}}, I: {\textit{inside}}, O: {\textit{other}}\}$.

The agent explores a path from the aspect to a node on the graph, and performs two tasks: (a) predict the label of the node, whether it is a sentiment word; and (b) predict the opinion label of the selected node. The state is a vector $s_t = (W, G, L^a, P_t)$, where $P_t$ is the path (comprised of the selected edges and nodes, and their predicted opinion labels) explored so far starting from an aspect. The action $a_t$ selects a next-hop node so the agent moves through the respective edge, and then predicts the opinion label for the next-hop node. The reward $r_T(s_T)$ represents the reward of the entire path $P_T$ when the agent reaches the terminal state $s_T$. This means that the agent may have selected and completed multiple actions before it reaches the terminal state in an episode to provide the entire path. The reward $r_T(s_T)$ has three components to generate a good path to cover all sentiment words and provide correct opinion labels for the sentiment words of the sentence. First, the ratio of the number of correctly predicted sentiment words to the number of sentiment words in the sentence, whereby a higher ratio indicates that more sentiment words have been covered; (b) the number of words with correctly predicted opinion labels in the sentence; and (c) the ratio of the weighted numbers of sequential and syntactic edges to the number of edges in a sentence in which the weight adjusts the preference of sequential and syntactic edges. A lower ratio indicates that the entire path is shorter with fewer nodes (and actions), hence providing a higher efficiency.

The policy network, as shown in Fig. 13, consists of: (a) an MLP (multi-layer perceptron) receives the state vector and provides the current state; (b) a bi-directional gated recurrent unit (BiGRU) combines the current and past states to provide a joint state; (c) an MLP receives the action and provides the probability distribution of all actions given a state. The action, or the current word, is padded with more information for guiding exploration, and it is a combination of: (a) a syntactic word that pads neighbor nodes in the syntactic subgraph using a graph neural network; and (b) a sequential word that pads the previous word in the sequence subgraph using a BiGRU. The agent selects one of the neighbor nodes as the next hop in the path. The weights of the policy network are updated using the Monte Carlo tree search algorithm (Shen et al. 2018). The loss represents the difference between the reward and the long-term reward (or the value function).

The model is trained using the Sem-Eval 2014 Task 4 (D.3), Sem-Eval 2015 Task 12 (D.4), and Sem-Eval 2016 Task 5 (D.5) datasets, which are comprised of user reviews of restaurants and laptops (A.2). The proposed solution has been shown to achieve a higher classification performance (P.2), particularly precision, recall, and f-measure. Multiplex heterogeneous graph enhances RL for sentiment analysis by integrating sequential and syntactic information. The DRL agent can dynamically explore paths that accurately identify opinion words related to specific aspects. By adjusting edge selection parameters, the model adapts to different contexts and complex sentences. Nonetheless, future research can refine the criteria of selecting an edge by adding parameters that changes within a training. Also, this approach can be extend to other NLP tasks, such as different languages and domains.

5.8 Deep reinforcement learning with artificial neural network

Wang et al. (2019), DRL identifies logical partitions (O.1) while minimizing the effect of irrelevant words using the textual feature at the aspect level (Thet et al. 2010). Each document (e.g., a review) has a single or multiple aspects. The challenge is that the document has irrelevant words.

The agent identifies logical partitions comprised of aspects, and it interacts with the aspect classification module to classify the sentiment polarity of each aspect into either positive, neutral, or negative. The agent receives rewards from the aspect classification module. The state $s_t$ is a vector of the previous hidden state $h^t_{i-1}$, the current sentence, the aspect, and the position of the aspect. The action is $a_t=\{{\textit{start}}, {\textit{end}}, {\textit{other}}\}$, where ${\textit{start}}$ and ${\textit{end}}$ indicate the start and end words of a logical partition, respectively, and the ${\textit{other}}$ indicates otherwise. As shown in Fig. 14, the aspect classification module has: (a) a bi-directional LSTM model (Seo et al. 2016) provides representations of the logical partitions; and (b) an MLP classifies the sentiment of the logical partitions and generates the rewards. The reward $r_{t+1}(s_{t+1})$ is calculated based on: (a) the probability of the estimated sentiment polarity of a sentence, in which a higher value indicates that the right logical partition for the aspect has been provided contributing to the good performance in sentiment classification; (b) the presence of an aspect in a logical partition, which helps to maintain the aspect in the logical partition; and (c) the proportion of irrelevant words removed from the sentence under consideration, which helps to remove irrelevant words.

The policy network is an ANN. The weights of the policy network are updated using policy gradient methods (see Sect. 2.2) and backpropagation. The loss represents the difference between: (a) the actual (or estimated) reward received when the logical partition is provided to the aspect classification module; and (b) the real reward (or the ground truth) received when the entire sentence is provided to the aspect classification module.

The model is trained using the Sem-Eval 2014 Task 4 dataset (D.3) comprised of user reviews of restaurants and laptops. The proposed solution has been shown to achieve a higher prediction performance (P.1) with a higher similarity between actual and real sentiment polarities of aspects.

DRL in sentiment analysis often struggles with identifying aspects in noisy sentence. The enhancement of DRL with ANN shows its ability to identify logical partitions even in complex sentences, such as “good because it’s more food but bad because dim sum is supposed to be smaller...”. However, challenges remain with: (a) non-adjacent pairs of aspect and word, such as “though the spider roll may look like challenge to eat, with soft shell crab hanging out of the roll, it is well worth...”; and (b) multiple aspects with the same polarity, which is treated as a single aspect. Future research can refine the ANN with parameters to indicate distanced pairs of aspect and word.

5.9 Deep reinforcement learning based on Q-value distributions

Keerthana et al. (2021), DRL, particularly DQN, enables a chatbot (A.3) to provide responses with the right sentiments (O.3) using the textual feature at the sentence level. The challenge is to process long input sentences with a large number of tokens (e.g., up to 40 words).

The chatbot converses with users and answers their questions in the natural language. The state $s_t$ represents: (a) the semantic frames, which are lower-level representations of natural language utterances; and (b) the past four frames. Using the past four frames indicates that the decision epoch is four frames, whereby the agent selects an action every four frames to reduce computational cost and gather meaningful experience, which is essential because actions are less likely to change in every frame. The action $a_t$ represents the response output. The reward differentiates right and wrong response outputs, where $r_{t+1}(s_{t+1})=1$ represents the right response output and $r_t(s_t)=-1$ otherwise.

The policy network interacts with the natural language understanding (NLU) and the natural language generation (NLG) units. The policy network receives states from NLU and sends actions to NLG. Both NLU and NLG use bidirectional RNN (BRNN) that connects outputs with two hidden layers in opposite directions to provide information related to past and future states. There are three main steps. First, the NLU unit uses the current input words, current state, and next input word to extract and represent relevant keywords in the form of semantic frames. Second, the policy network uses the semantic frames and the past four frames to create a state representation. The policy network uses an enhanced DQN approach to provide a distribution over Q-values of an action, instead of the mean value of the Q-value of an action in the traditional DQN approach, and then select actions. Using distribution, instead of the mean value, contributes to more stable Q-values and a higher convergence rate. The enhanced DQN approach uses quantile regression to learn and compute quantile values on fixed uniform quantile fractions, and then minimizes the quantile Huber loss. Third, the NLG receives and converts the selected actions into natural language that users can understand.

The model is trained using the Cornell movie-dialogs corpus (D.8) and the CoQA (D.9) datasets. The proposed solution has been shown to achieve a higher cumulative reward (P.5) and a high prediction performance (P.1), particularly BLEU. The enhanced DQN approach considers the entire Q-value distribution and provides more possible actions. The use of quantile regression also further refines this process by learning and estimating a specific part of the distribution, rather than focusing on a single mean value. With this advantage, Q-value distributions integrated in RL shows a significant advancement in sentiment analysis. Nonetheless, there is still a research gap in Q-value distributions with other advanced RL techniques, such as actor-critic methods or policy gradients. As this study shows promising generalization in different datasets, future research can also explore multilanguage datasets.

5.10 Summary of RL approaches in sentiment analysis

Table 10 presents a summary of enhanced RL approaches across six attributes. This section comprehensively links and compares the attributes discussed in the previous sections, as illustrated in Fig. 5, covering objectives, applications, performance metrics, and datasets.

5.10.1 Quantitative summary

As discussed in Sect. 3.1, five objectives have been addressed by enhanced RL approaches. The three primary objectives are sentiment classification, response adjustment, and aspect extraction from unlabeled data. Among these, classifying sentiments receives significant attention, with 6 papers focusing on this goal (Zhang et al. 2019a; Wang et al. 2021; Wu et al. 2024; Zhao et al. 2024; Pröllochs et al. 2020). Adjusting responses and extracting aspects from unlabeled data are also critical areas of focus, each addressed by two studies.

As mentioned in Sect. 3.2, four applications represented suitable use cases for these enhanced RL approaches. Out of them, electronic commerce are prominently highlighted, particularly for sentiment classification tasks (Venugopalan and Gupta 2022; Wang et al. 2021; Wu et al. 2024; Zhao et al. 2024; Dai et al. 2022). Other important application areas include social networks and chatbots, which leverage RL algorithms to enhance user interactions and experience. Speech processing is also a noted application, demonstrating the adaptability of RL approaches in handling speech-to-text.

As highlighted in Sect. 3.3, there are six aspects for performance metrics. Most studies utilize classification performance metrics and have been shown to achieve positive results. Prediction performance is another important metric for evaluating models. Four studies, including (Keerthana et al. 2021; Pröllochs et al. 2020; Wang et al. 2019; Li et al. 2018), successfully enhanced prediction performance with their RL algorithms.

As detailed in Sect. 4.1, there are 14 datasets used in the studies. SemEval-2014 Task 4 appeared to be the most popular dataset for sentiment classification, followed by the upgraded versions of SemEval-2015 Task 5 and Task 12. Additionally, data from Twitter is also widely used for extracting sentiment-related words and classifying sentiments accordingly.

5.10.2 Qualitative summary

Table 10 highlights various opportunities for future advancements of RL and DRL in sentiment analysis. According to Wang et al. (2019), DRL with ANN struggles with handling non-adjacent pairs of aspects and sentiment words. It relies on organized data like a straightforward e-commerce review. However, ACRL (Venugopalan and Gupta 2022; Li et al. 2018) shows the ability of extracting aspects from unlabeled data, thereby extends DRL with the ANN approach in identifying logical partitions. Consequently, ACRL can be further explored in classifying sentiment polarities of distanced aspects and word, such as unorganized data found in social networks.

Additionally, further investigation can be performed to understand the effect of RL with episodic rewards in restaurant reviews. For instance, DRL (Yang et al. 2019) struggles in identifying negated or latent opinion in restaurant reviews. Given the advancements of RL with episodic rewards (Pröllochs et al. 2020) in classifying sentiment polarities of negated sentences, further investigation can examine how this approach performs with restaurant reviews that contains latent meaning. For example, a sentence successfully processed by RL with episodic rewards (e.g., “This fancy product isn’t good but fantastic.”) could be replaced with the error-prone sentence in the DRL approach (e.g., “I can understand the prices if it served better food.”).

Moreover, further investigation can explore the application of DRL with dependency graph in chatbots. First, all related approaches (Wang et al. 2021; Zhao et al. 2024; Wu et al. 2024) show significant improvement in classifying sentiment polarities especially in latent contexts for e-commerce reviews and social network comments. However, there is limited experimentation in chatbot applications. In contrast, RL with baseline models as the reward functions (Lee et al. 2018) has shown significant improvement in adjusting responses in a chatbot. However, this approach faces a challenge in preserving semantics and generating quality responses due to latent contexts. Thus, DRL with dependency graph shows potential in extending RL with baseline models in chatbot applications.

Additionally, further investigation can be performed for DRL with Q-value distributions in multimodal contexts (e.g., speech-to-text). DRL with Q-value distributions (Keerthana et al. 2021) learns from actual conversations between movie characters and shows significant improvement in adjusting chatbot responses. Meanwhile, DRL with LSTM (Zhang et al. 2019a) explores the benefits of speech-to-text application (e.g., end of smiling determines the relevant sentiment word). The former approach provides a strong foundation for realistic dialogues, while the latter improves the accuracy in understanding long spoken contexts. The combination of these approaches can potentially advance a real-time, multimodal conversational application.

Table 10 Summary of the attributes of RL algorithms

Full size table

6 Open issues

This section outlines the current research gaps in RL and sentiment analysis, highlighting directions for future research and investigation. Additionally, this section proposes models and solutions to emphasize their priorities over the challenges in current studies.

6.1 Addressing the challenges of extracting logical partitions

Some logical partitions have been reported to be difficult to extract. First, when there are separate descriptions of an aspect that are adjacent to each other, the description tends to be regarded as another aspect (Wang et al. 2019). Second, when there are separate aspects in a short sentence, the entire sentence tends to be regarded as a single aspect (Wang et al. 2019). There are two main challenges, namely: (a) the semantic understanding of the text; and (b) the granular analysis of the sentence structure and meaning. Improving the extraction of logical partitions increases the accuracy in organizing and summarizing key information in a sentence, enhancing the ability of RL agents to extract knowledge from textual datasets efficiently. To overcome this issue, advanced NLP techniques, like BERT (Devlin et al. 2018), incorporate co-references to link words to related descriptions for capturing a better semantic meaning and the relationships of aspects. The proposed solution enables a more robust logical partition, resulting in better comprehension and utilization of textual data by RL agents, potentially improving the F1 score.

6.2 Addressing the challenges of extracting correct sentiment words for aspects

Some sentiment words have been reported to be difficult to extract and relate to the right aspects. First, latent opinions requiring deep understanding tend to be incorrectly classified. For instance, “I can understand the price if it served better food” in which the “better” word has a significant weight that the “price” is incorrectly classified with a positive sentiment polarity although it should be a negative (Yang et al. 2019). Second, the sentiment polarity of an aspect tends to be affected by other sentiment words in the same sentence when the real sentiment polarity of the aspect is neutral. For instance, “A beautiful atmosphere, perfect for drinks” in which the “drinks” is incorrectly classified with a positive sentiment (Yang et al. 2019). Inaccurate sentiment word extraction leads to incorrect sentiment classification, reducing the reliability of aspect-based sentiment analysis. One way to resolve this issue is to develop an attention mechanism (Vaswani et al. 2017) that focuses on the relevant sentiment word for each aspect. Another way is to use the Monte Carlo tree search algorithm (Shen et al. 2018) to extract aspects and their associated sentiment words. The algorithm treats each sentence as a tree where aspects and sentiment words are incrementally added as nodes. Random walks down the trees allow exploration to find coherent aspects and sentiment word. The reward function measures the extraction accuracy. This method explores different possible aspect and sentiment word combinations to find an optimal extraction specific to each sentence. This enables more accurate sentiment analysis, leading to an increased accuracy in sentiment classification.

6.3 Exploring more data types for training RL

Most RL-based dialogue systems utilize limited state representations, such as the current user utterance, previous system response, and predicted sentiment polarities. However, supplementary user data types, including user profiles, beyond-session histories (e.g., customer service level agreement), and user request types remain largely unexplored. Incorporating such enriched inputs enables RL agents to construct better user state representations. This opens up possibilities for dialogue systems to achieve personalization, optimize priority determination, and reduce resolution time. There are two main challenges of utilizing additional data types, namely: (a) the increased complexity of the policy; and (b) the increased cost of collecting quality data for incorporating additional dimensions. To resolve this issue: (a) graph neural networks can represent the relationships between different data types in a lower-dimensional latent space for RL; and (b) generative adversarial networks (GAN) (Li et al. 2018) can simulate synthetic samples for training when real data is lacking. The proposed solutions are expected to improve performances, including higher human satisfaction with the system response time and relevancy.

6.4 Improving user’s satisfaction with multimodal datasets

The RL-based sentiment analysis can be explored to improve user’s satisfaction. User feedback in the form of various modalities, such as acoustic and textual features, can be integrated into the RL models for various applications. RL-based sentiment analysis has been applied to provide the right state reflecting user sentiment, such as when providing labels (e.g., inquiry, suggestion, complaint, request for help, and gratitude) categorizing petitions submitted by the public (Li et al. 2023). RL-based sentiment analysis has also been applied to provide the right reward values reflecting user sentiment, such as determining: (a) the credit card being offered to the user in a credit card recommendation system (Jain and Fallon 2023; b) the courses being recommended to the user in a course recommendation system (Vedavathi and Bharadwaj 2022); and (c) the joint values of the knee, hip, and ankle of the agent robot for achieving balance in human-robot cooperative tasks (Jeon et al. 2023). The challenge of RL is to customize responses, which can be from a simple textual response to recommended personalized and value-added products, for each user to meet their preferences, such as price and future needs. Due to the competitiveness of the digital marketplace, RL can use sentiment analysis to meet diverse and dynamic user demands. Apart from measurable factors, such as price and delivery time, non-measurable factors, including user relationship and value proposition, are taken into consideration in sentiment analysis. According to (Lemmens and Gupta 2020), poor customer relationships and services have cost 70% of the customer churn rate. One of the solutions to this is combining sentiment modeling with a dynamic shaping of rewards that changes the algorithm parameters according to user sentiments. This helps to improve performance metrics related to predictions, such as a higher BLEU.

6.5 Improving generalization across different domain

Poor generalization tends to occur when there is an insufficient amount of manually labeled data (Wang et al. 2021). This often leads to overfitting, where the model performs well under supervised training but fails under unsupervised training. Improving generalization involves tackling challenges related to data diversity, domain adaptation, and data variability. One model may need to be trained with multiple data sources or domains, allowing it to learn diverse linguistic patterns and sentiments for a better adaptability. Improving generalization is significant as it enhances the robustness of sentiment analysis and makes it applicable across different domains. Techniques such as adversarial learning (Li et al. 2018) or fine-tuning (Liu et al. 2023) enable models to perform well in scenarios they are not initially trained for. Moreover, expanding training data to encompass a wider range of sentiments and linguistic styles enhances a model’s adaptability to different emotional expressions, ultimately improving its generalization capabilities. Other techniques such as transfer learning (Wang et al. 2022) should be further studied to create a cross-lingual mapping. For example, in (Zhang et al. 2019b), transfer learning is used to change the sentiment from positive to negative, and from book reviews to movie reviews. The altered text reflects the new sentiment, while retaining the aspects from the original domain. With this approach, the data model can be used in a broader context while retaining the important details of one aspect. Addressing these challenges ensures that the trained model can be used in broader datasets without biases and prediction errors. Hence, the model’s ability is enhanced to handle diverse real-world scenarios.

6.6 Addressing the challenges of understanding sentiment in multilanguages

Understanding multilanguage sentiment refers to the ability of a model to analyze the sentiment polarities across multiple languages. It involves methods to identify tones, idioms, language structures, and synonyms from sentences in one or more languages. Understanding sentiment across multilingual environments is challenging due to: (a) the influence of different cultural contexts and histories to sentiment analysis; for example, emotional expressions can vary across different languages; (b) dialect or mixed-language in a single sentence that can affect the nuance of the sentence; (c) the language structure and number of characters that vary across different languages (Park et al. 2020); and (d) the limited datasets, such as insufficient labeled lexicons for uncommon languages (Agüero-Torales et al. 2021), or word image is unavailable before the study in Park et al. (2020). Successfully addressing multilingual sentiment analysis holds great promise in enabling seamless sentiment comprehension across global communication platforms, fostering better cross-cultural insights, and enhancing the capabilities of sentiment-aware applications in multilingual settings. Deriu et al. (2017), a joint training is introduced where a CNN is used to train 4 languages at once. The proposed approach shows a better performance in a mixed-language environment. However, under a monolingual environment, the proposed solution results show a 2.45% lower classification performance in f-measure compared to training with a single language. The proposed solutions are expected to improve the accuracy of sentiment classification across diverse languages, such as a higher f-measure.

6.7 Leveraging large language models to identify trends

With a large amount of training data, identifying trends across different domains has shown promise. However, the key challenge is training models that learn substantial trends instead of a one-off event from historical data. There is also difficulty in scaling up high-quality datasets for evaluating systems and identifying trends. Nonetheless, models like GPT-4 has made good progress in this area to reveal the evolved sentiment patterns around various domains, including business, brands, public health, and world events, before they are visible to the public (Gallifant et al. 2024). This can detect an early warning and provide valuable insights to experts in these domains. Some solutions that may help overcome the challenges include combining both labeled and unlabeled sentiment data, analyzing relationships between events and sentiment swings, and creating better tests to assess sentiment. By harnessing the vast knowledge of these models, this can significantly assist professionals across many domains. The key performance metrics for progress are the precision in detecting meaningful trends, computational speed, scalability, and the time taken before the real-world trend emerges.

6.8 Exploring the applications of RL-based sentiment analysis

The RL-based sentiment analysis can be further extended to a broader range of professional applications. There are domain-specific linguistic expressions and sentiment nuances in different domains, each requires a specialized adaptation of RL models. The main challenge is to tailor an RL model for each domain. Cross-over adaptation is challenging due to domain-specific lexicons, context, and sentiment variations towards the same subject. Addressing this challenge is useful for leveraging RL in sentiment analysis across different domains. One of the solutions is to use transfer learning techniques in RL fine-tuning. This can improve accuracy in sentiment analysis across different domains without training them all over again. Some improvement of performance metrics include a higher RL performance, such as the cumulative reward, for inducing a positive human sentiment.

7 Conclusion

This paper presents a review of the application of reinforcement learning (RL) for sentiment analysis. Sentiment analysis has been shown to succeed in understanding user sentiments and preferences. RL and DRL extend the application by reducing the need for supervised learning, reducing the amount of training data, enhancing scalability, and preserving the context and logical partition for sentiment analysis. Various aspects are presented in this paper in the RL in sentiment analysis topic, including the objectives, applications, performance metrics, datasets, and pre-processing. Most importantly, this paper presents a review of multiple RL and DRL approaches in sentiment analysis (e.g., RL with baseline models, RL with episodic rewards, DRL with LSTM, ANN, aACRL, and Q-value distributions) and their enhancements (e.g, DRL with dependency graphs, and DRL with multiplex heterogeneous graph). RL and DRL in sentiment analysis have been applied in various fields, such as social networks, e-commerce websites, chatbots, speech, and machine translation. The forms of state, action, and reward, and action for each application are thoroughly discussed with each algorithm. RL in sentiment analysis has been shown to achieve higher performance based on human evaluation and baseline models, and prediction and classification performances. DRL in sentiment analysis has been shown to achieve a higher cumulative reward, and classification and prediction performances. Lastly, this paper presents open issues for further research, revolving around logical partitions, contexts, data types, user satisfaction, generalization, multilingual understanding, and trends.

Data availability

There are no data available for this paper.

References

Aftab MO, Ahmad U, Khalid S et al (2021) Sentiment analysis of customer for ecommerce by applying AI. In: 2021 International conference on innovative computing (ICIC). IEEE, pp 1–7
Agüero-Torales MM, Salas JIA, López-Herrera AG (2021) Deep learning and multilingual sentiment analysis on social media data: an overview. Appl Soft Comput 107:107373
Article Google Scholar
Bergstra J, Yamins D, Cox D (2013) Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In: International conference on machine learning. PMLR, pp 115–123
Bisane A, Chandravanshi S, Thakre P et al (2023) A comprehensive product review system for improved customer satisfaction. In: 2023 2nd International conference on paradigm shifts in communications embedded systems, machine learning and signal processing (PCEMS). IEEE, pp 1–6
Borg A, Boldt M (2020) Using VADER sentiment and SVM for predicting customer response sentiment. Expert Syst Appl 162:113746
Article MATH Google Scholar
Chandra RV, Varanasi BS (2015) Python requests essentials. Packt Publishing Birmingham, UK
Google Scholar
Chawla NV, Bowyer KW, Hall LO et al (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article MATH Google Scholar
Covington MA (2001) A fundamental algorithm for dependency parsing. In: Proceedings of the 39th annual ACM southeast conference, Athens, GA
Dai Y, Wang P, Zhu X (2022) Reasoning over multiplex heterogeneous graph for target-oriented opinion words extraction. Knowl Based Syst 236:107723
Article MATH Google Scholar
Degottex G, Kane J, Drugman T et al (2014) Covarep-a collaborative voice analysis repository for speech technologies. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 960–964
Deriu J, Lucchi A, De Luca V et al (2017) Leveraging large amounts of weakly supervised data for multi-language sentiment classification. In: Proceedings of the 26th international conference on world wide web, pp 1045–1052
Devlin J, Chang MW, Lee K et al (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Dong L, Wei F, Tan C et al (2014) Adaptive recursive neural network for target-dependent twitter sentiment classification. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 2: Short papers), pp 49–54
Drugman T, Alwan A (2019) Joint robust voicing detection and pitch estimation based on residual harmonics. arXiv preprint arXiv:2001.00459
Drugman T, Thomas M, Gudnason J et al (2011) Detection of glottal closure instants from speech signals: a quantitative review. IEEE Trans Audio Speech Lang Process 20(3):994–1006
Article MATH Google Scholar
Elreedy D, Atiya AF (2019) A comprehensive analysis of synthetic minority oversampling technique (smote) for handling class imbalance. Inf Sci 505:32–64
Article MATH Google Scholar
Eshak MI, Ahmad R, Sarlan A (2017) A preliminary study on hybrid sentiment model for customer purchase intention analysis in socialcommerce. In: 2017 IEEE conference on big data and analytics (ICBDA). IEEE, pp 61–66
Gallifant J, Fiske A, Levites Strekalova YA et al (2024) Peer review of gpt-4 technical report and systems card. PLOS Digital Health 3(1):e0000417
Article Google Scholar
Hardeniya N, Perkins J, Chopra D et al (2016) Natural language processing: python and NLTK. Packt Publishing Ltd
Heryadi Y, Wijanarko BD, Murad DF et al (2023) Restaurant customer feedback sentiment analysis using aspect embedding long short-term memory model. In: 2023 International conference on computer science, information technology and engineering (ICCoSITE). IEEE, pp 106–110
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article MATH Google Scholar
Honnibal M, Montani I, Van Landeghem S et al (2020) Industrial-strength natural language processing in python. spaCy
Jain S, Fallon E (2023) Leveraging unstructured data to improve customer engagement and revenue in financial institutions: a deep reinforcement learning approach to personalized transaction recommendations. In: 2023 International conference on computer, information and telecommunication systems (CITS). IEEE, pp 01–08
Jang E, Gu S, Poole B (2016) Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144
Jeon H, Kim DW, Kang BY (2023) Deep reinforcement learning for cooperative robots based on adaptive sentiment feedback. Available at SSRN 4471793
Jiang Q, Chen L, Xu R et al (2019) A challenge dataset and effective models for aspect-based sentiment analysis. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 6280–6285
Kalashami MP, Pedram MM, Sadr H (2022) EEG feature extraction and data augmentation in emotion recognition. Comput Intell Neurosci 1:7028517
Google Scholar
Kane J, Gobl C (2013) Wavelet maxima dispersion for breathy to tense voice discrimination. IEEE Trans Audio Speech Lang Process 21(6):1170–1179
Article MATH Google Scholar
Keerthana RR, Fathima G, Florence L (2021) Evaluating the performance of various deep reinforcement learning algorithms for a conversational chatbot. In: 2021 2nd International conference for emerging technology (INCET). IEEE, pp 1–8
Kumar A (2023) A machine learning-based automated approach for mining customer opinion. In: 2023 4th International conference on electronics and sustainable communication systems (ICESC). IEEE, pp 806–811
Kusner M, Sun Y, Kolkin N et al (2015) From word embeddings to document distances. In: International conference on machine learning. PMLR, pp 957–966
Lee CW, Wang YS, Hsu TY et al (2018) Scalable sentiment for sequence-to-sequence chatbot response with performance analysis. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6164–6168
Lemmens A, Gupta S (2020) Managing churn to maximize profits. Mark Sci 39(5):956–973
Article MATH Google Scholar
Li Y, Pan Q, Wang S et al (2018) A generative model for category text generation. Inf Sci 450:301–315
Article MathSciNet MATH Google Scholar
Li Y, Fang W, Sun H et al (2023) Pecidrl: Petition expectation correction and identification based on deep reinforcement learning. Inf Process Manag 60(3):103285
Article MATH Google Scholar
Liu Z, Xu Y, Ji X et al (2023) Twins: a fine-tuning framework for improved transferability of adversarial robustness and generalization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16436–16446
Loria S et al (2018) textblob documentation. Release 015 2(8):269
Luong MT, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025
Mikolov T, Karafiát M, Burget L et al (2010) Recurrent neural network based language model. In: Interspeech, Makuhari, pp 1045–1048
Mnih V, Kavukcuoglu K, Silver D et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533
Article Google Scholar
Moschitti A (2006) Making tree kernels practical for natural language learning. In: 11th conference of the European chapter of the association for computational linguistics, pp 113–120
Nagamanjula R, Pethalakshmi A (2018) A machine learning based sentiment analysis by selecting features for predicting customer reviews. In: 2018 Second international conference on intelligent computing and control systems (ICICCS). IEEE, pp 1837–1843
Nakov P, Zesch T (2016) Computational semantic analysis of language: Semeval-2014 and beyond. Lang Resour Eval 50:1–4
Article MATH Google Scholar
Nguyen A, Clune J, Bengio Y et al (2017) Plug & play generative networks: conditional iterative generation of images in latent space. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4467–4477
Obiedat R, Qaddoura R, Ala’M AZ et al (2022) Sentiment analysis of customers’ reviews using a hybrid evolutionary SVM-based approach in an imbalanced data distribution. IEEE Access 10:22260–22273
Article Google Scholar
Page MJ, McKenzie JE, Bossuyt PM et al (2021) The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. bmj 372
Pal S, Gupta Y, Shukla A et al (2020) Activethief: model extraction using active learning and unannotated public data. In: Proceedings of the AAAI conference on artificial intelligence, pp 865–872
Park J, Lee E, Kim Y et al (2020) Multi-lingual optical character recognition system using the reinforcement learning of character segmenter. IEEE Access 8:174437–174448
Article Google Scholar
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Pontiki M, Galanis D, Pavlopoulos J et al (2014) SemEval-2014 task 4: aspect based sentiment analysis. In: Nakov P, Zesch T (eds) Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014). Association for Computational Linguistics, Dublin, Ireland, pp 27–35. https://doi.org/10.3115/v1/S14-2004, https://aclanthology.org/S14-2004
Pontiki M, Galanis D, Papageorgiou H et al (2015) SemEval-2015 task 12: Aspect based sentiment analysis. In: Nakov P, Zesch T, Cer D et al (eds) Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015). Association for Computational Linguistics, Denver, Colorado, pp 486–495. https://doi.org/10.18653/v1/S15-2082, https://aclanthology.org/S15-2082
Pontiki M, Galanis D, Papageorgiou H et al (2016) Semeval-2016 task 5: aspect based sentiment analysis. In: International workshop on semantic evaluation, pp 19–30
Pröllochs N, Feuerriegel S, Lutz B et al (2020) Negation scope detection for sentiment analysis: a reinforcement learning framework for replicating human interpretations. Inf Sci 536:205–221
Article Google Scholar
Psathas G (1969) The general inquirer: useful or not? Comput Humanit 3(3):163–174
Article Google Scholar
Sadr H, Nazari Soleimandarabi M (2022) Acnn-tl: attention-based convolutional neural network coupling with transfer learning and contextualized word representation for enhancing the performance of sentiment classification. J Supercomput 78(7):10149–10175
Article MATH Google Scholar
Seo M, Kembhavi A, Farhadi A et al (2016) Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603
Shen Y, Chen J, Huang PS et al (2018) M-walk: learning to walk over graphs using Monte Carlo tree search. Adv Neural Inf Process Syst 31
Singh VK, Singh P, Karmakar M et al (2021) The journal coverage of web of science, scopus and dimensions: a comparative analysis. Scientometrics 126:5113–5142
Article MATH Google Scholar
Sushmitha M, Suresh K, Vandana K (2022) To predict customer sentimental behavior by using enhanced bi-LSTM technique. In: 2022 7th International conference on communication and electronics Systems (ICCES). IEEE, pp 969–975
Sutton RS, McAllester D, Singh S et al (1999) Policy gradient methods for reinforcement learning with function approximation. Adv Neural Inf Process Syst 12
Thet TT, Na JC, Khoo CS (2010) Aspect-based sentiment analysis of movie reviews on discussion boards. J Inf Sci 36(6):823–848
Article MATH Google Scholar
Uzun E, Yerlikaya T, Kirat O (2018) Comparison of python libraries used for web data extraction. J Tech Univ Sofia Plovdiv Branch, Bulgaria 24:87–92
MATH Google Scholar
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Vedavathi N, Bharadwaj RS (2022) Deep flamingo search and reinforcement learning based recommendation system for e-learning platform using social media. Proc Comput Sci 215:192–201
Article Google Scholar
Venugopalan M, Gupta D (2022) A reinforced active learning approach for optimal sampling in aspect term extraction for sentiment analysis. Expert Syst Appl 209:118228
Article MATH Google Scholar
Vinyals O, Le Q (2015) A neural conversational model. arXiv preprint arXiv:1506.05869
Wang T, Zhou J, Hu QV et al (2019) Aspect-level sentiment classification with reinforcement learning. In: 2019 International joint conference on neural networks (IJCNN). IEEE, pp 1–8
Wang L, Zong B, Liu Y et al (2021) Aspect-based sentiment classification via reinforcement learning. In: 2021 IEEE International conference on data mining (ICDM), IEEE, pp 1391–1396
Wang Z, Zhao Y, Wu L et al (2022) Cross-language transfer learning-based Lhasa-Tibetan speech recognition. Comput Mater Continua 73(1):629–639
Article MATH Google Scholar
Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8:229–256
Article MATH Google Scholar
Wu H, Zhou D, Sun C et al (2024) LSOIT: Lexicon and syntax enhanced opinion induction tree for aspect-based sentiment analysis. Expert Syst Appl 235:121137
Article Google Scholar
Yang M, Jiang Q, Shen Y et al (2019) Hierarchical human-like strategy for aspect-level sentiment classification with sentiment linguistic knowledge and reinforcement learning. Neural Netw 117:240–248
Article MATH Google Scholar
Yuan J, Liberman M et al (2008) Speaker identification on the SCOTUS corpus. J Acoust Soc Am 123(5):3878
Article MATH Google Scholar
Zadeh A, Zellers R, Pincus E et al (2016) MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259
Zadeh AB, Liang PP, Poria S et al (2018) Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers), pp 2236–2246
Zhang C, Sedoc J, D’Haro LF et al (2021) Automatic evaluation and moderation of open-domain dialogue systems. arXiv preprint arXiv:2111.02110
Zhang D, Li S, Zhu Q et al (2019a) Modeling the clause-level structure to multimodal sentiment analysis via reinforcement learning. In: 2019 IEEE international conference on multimedia and expo (ICME). IEEE, pp 730–735
Zhang R, Wang Z, Yin K et al (2019b) Emotional text generation based on cross-domain sentiment transfer. IEEE Access 7:100081–100089
Zhao X, Peng H, Dai Q et al (2024) RDGCN: Reinforced dependency graph convolutional network for aspect-based sentiment analysis. In: Proceedings of the 17th ACM international conference on web search and data mining, pp 976–984

Download references

Funding

This work was supported by the Ministry of Higher Education (MOHE), Malaysia through the Fundamental Research Grant Scheme (FRGS/1/2022/ICT02/UTAR/01/2).

Author information

Authors and Affiliations

Lee Kong Chian Faculty of Engineering and Science, Universiti Tunku Abdul Rahman (UTAR), Sungai Long Campus, 43200, Kajang, Selangor, Malaysia
Jer Min Eyu & Kok-Lim Alvin Yau
Guangzhou Institute of Technology, Xidian University, Xi’An, China
Lei Liu
School of Computer Sciences, Universiti Sains Malaysia, USM, 11800, Gelugor, Penang, Malaysia
Yung-Wey Chong

Authors

Jer Min Eyu
View author publications
You can also search for this author in PubMed Google Scholar
Kok-Lim Alvin Yau
View author publications
You can also search for this author in PubMed Google Scholar
Lei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yung-Wey Chong
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed equally and reviewed the manuscript.

Corresponding author

Correspondence to Kok-Lim Alvin Yau.

Ethics declarations

Conflict of interest

The authors declare that there is no Conflict of interest.

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

All the authors of this research have demonstrated their participation voluntarily.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Eyu, J.M., Yau, KL.A., Liu, L. et al. Reinforcement learning in sentiment analysis: a review and future directions. Artif Intell Rev 58, 6 (2025). https://doi.org/10.1007/s10462-024-10967-0

Download citation

Accepted: 16 September 2024
Published: 07 November 2024
DOI: https://doi.org/10.1007/s10462-024-10967-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Reinforcement learning in sentiment analysis: a review and future directions

Abstract

Similar content being viewed by others

Systematic Literature Review on Sentiment Analysis in Airline Industry

Sentiment analysis using deep learning techniques: a comprehensive review

Sentiment analysis with deep neural networks: comparative study and performance assessment

Explore related subjects

1 Introduction

1.1 Our contributions

1.2 Organization of this paper

1.3 Literature searching process

2 Background

2.1 Reinforcement learning

2.2 REINFORCE

2.3 Deep reinforcement learning

3 Attributes of reinforcement learning in sentiment analysis

3.1 Objectives of sentiment analysis

3.2 Applications

3.3 Performance Metrics

4 Essentials for training RL in sentiment analysis

4.1 Datasets

4.2 Pre-processes

4.3 Implementation details

4.4 Hyper-parameters and values

5 Reinforcement learning approaches in sentiment analysis

5.1 Traditional reinforcement learning with baseline models

5.2 Traditional reinforcement learning with episodic rewards

5.3 Actor-critic reinforcement learning

5.4 Deep reinforcement learning

5.5 Deep reinforcement learning with long short-term memory

5.5.1 Long short-term memory

5.5.2 The proposed deep reinforcement learning with long short-term memory approach

5.6 Deep reinforcement learning with dependency graphs

5.7 Deep reinforcement learning with multiplex heterogeneous graph

5.8 Deep reinforcement learning with artificial neural network

5.9 Deep reinforcement learning based on Q-value distributions

5.10 Summary of RL approaches in sentiment analysis

5.10.1 Quantitative summary

5.10.2 Qualitative summary

6 Open issues

6.1 Addressing the challenges of extracting logical partitions

6.2 Addressing the challenges of extracting correct sentiment words for aspects

6.3 Exploring more data types for training RL

6.4 Improving user’s satisfaction with multimodal datasets

6.5 Improving generalization across different domain

6.6 Addressing the challenges of understanding sentiment in multilanguages

6.7 Leveraging large language models to identify trends

6.8 Exploring the applications of RL-based sentiment analysis

7 Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent for publication

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation