Are Your Comments Positive? A Self-Distillation Contrastive Learning Method for Analyzing Online Public Opinion

Zhou, Dongyang; Shi, Lida; Wang, Bo; Xu, Hao; Huang, Wei

doi:10.3390/electronics13132509

Open AccessArticle

Are Your Comments Positive? A Self-Distillation Contrastive Learning Method for Analyzing Online Public Opinion

by

Dongyang Zhou

¹,

Lida Shi

²

,

Bo Wang

³,

Hao Xu

⁴ and

Wei Huang

^1,*

¹

School of Business and Management, Jilin University, Jilin 130015, China

²

School of Artificial Intelligence, Jilin University, Jilin 130015, China

³

School of Engineering, University of Minho, Campus de Gualtar, Braga, 4710-057 Braga, Portugal

⁴

College of Computer Science and Technology, Jilin University, Jilin 130015, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(13), 2509; https://doi.org/10.3390/electronics13132509

Submission received: 17 May 2024 / Revised: 20 June 2024 / Accepted: 23 June 2024 / Published: 26 June 2024

(This article belongs to the Special Issue Data-Driven AI Approaches with Applications in Social Network, Media Analytics and Smart Cities)

Download

Browse Figures

Versions Notes

Abstract

:

With the popularity of social media, online opinion analysis is becoming more and more widely and deeply used in management studies. Automatically recognizing the sentiment of user reviews is a crucial tool for opinion analysis research. However, previous studies mainly have focused on specific scenarios or algorithms that cannot be directly applied to real-world opinion analysis. To address this issue, we collect a new dataset of user reviews from multiple real-world scenarios such as e-retail, e-commerce, movie reviews, and social media. Due to the heterogeneity and complexity of this multi-scenario review data, we propose a self-distillation contrastive learning method. Specifically, we utilize two EMA (exponential moving average) models to generate soft labels as additional supervision. Additionally, we introduce the prototypical supervised contrastive learning module to reduce the variability of data in different scenarios by pulling in representations of the same class. Our method has proven to be extremely competitive, outperforming other advanced methods. Specifically, our method achieves an 87.44% F1 score, exceeding the performance of current advanced methods by 1.07%. Experimental results, including examples and visualization analysis, further demonstrate the superiority of our method.

Keywords:

public opinion analysis; sentiment classification; self-distillation; contrastive learning

1. Introduction

Public opinion analysis involves analyzing, monitoring, and researching the attitudes, perceptions, and evaluations of the public or a specific group of people toward an event, individual, or organization. It includes collecting, organizing, analyzing, and interpreting public opinion. Its purpose is to reveal trends of changes in public opinions and attitudes, which can provide decision-making references for relevant subjects. Identifying sentiment polarity in public comments is a crucial step towards conducting opinion analysis. With the progress in artificial intelligence, researchers have attempted to use computer techniques to interpret sentiment in public comments. In some of the earlier studies, Refs. [1,2,3,4] used the emotion lexicon to identify the emotion of the sentence. With the development of machine learning, Refs. [5,6] employed machine learning methods to identify the sentiment state of a sentence and made sentiment analysis a big hit across the field of NLP (natural language processing). In recent years, deep learning methods [7,8,9,10] have shown remarkable performance in the field of sentiment analysis. To summarize, there has been a significant amount of work that has made breakthroughs in the field of sentiment analysis.

However, many studies in sentiment analysis focus on the algorithmic level and train their models using data from a single scenario, such as movie or takeout reviews. However, these models cannot accurately judge the sentiment polarity of reviews in real-world situations, which poses a great challenge for further public opinion analysis in this field. To fill this research gap, we compile a new dataset that includes reviews from various sources such as e-commerce, social media, and movies. The dataset can provide support for subsequent public opinion research.

Due to the fact that this dataset is composed of multiple real-world scenarios, merging data from these different scenarios raises two issues: 1 data heterogeneity and 2 data diversity. As shown in Figure 1, the dataset contains many application scenarios, such as e-commerce, social platforms, shopping, etc. The data in different scenarios vary greatly, mainly in tone, word choice, etc. Therefore, the heterogeneity of these data may pose difficulties for model training. In addition, these multi-scenario data increase the complexity and diversity of the entire dataset. As shown in Figure 1, these examples exhibit duality, with positive and negative emotional statements, which could potentially confuse the model.

To address these issues, we propose a novel self-distillation framework. Specifically, we utilize two EMA (exponential moving average) models to generate soft labels as additional supervision, thereby alleviating model confusion. In addition to this, we introduce the prototypical supervised contrastive learning module. It reduces the variability of data in different scenarios by pulling in representations of the same class, thus mitigating the data heterogeneity caused by multiple scenarios.

The contributions of this paper are as follows:

We have constructed a multi-scenario sentiment analysis dataset to facilitate subsequent research on public opinion analysis.
We propose a simple self-distillation framework to alleviate model confusion, enhancing the training and generalization capabilities of the model.
We introduce the prototypical supervised contrastive learning module to mitigate the heterogeneity introduced by multiple scenarios in the dataset.

The organization of this paper is as follows:

Chapter 2: This chapter describes the related work, providing a comprehensive overview of the existing literature and identifying the research gaps that this study aims to address.
Chapter 3: This chapter explains the process of constructing the public opinion dataset.
Chapter 4: This chapter outlines the proposed framework and methodology, detailing the theoretical foundations, design considerations, and implementation specifics.
Chapter 5: This chapter presents the experimental results of our methodology, including performance evaluations, comparative analyses with baseline methods, and further in-depth analyses.
Chapter 6: This chapter concludes the paper, summarizing the key findings, discussing the implications of the results, and suggesting directions for future research.

2. Related Work

2.1. Sentiment Analysis

Sentiment analysis is an important and highly focused research task in natural language processing. Refs. [1,2,3,11,12] have attempted to determine the sentiment polarity of sentences by using sentiment dictionaries or defining rules. These methods are based on a priori human knowledge and do not require additional data or model training. Obviously, such methods are limited by application scenarios and are not generalizable and robust. With the popularity of machine learning, many researchers are trying to use machine learning methods for sentiment analysis. This method involves domain experts constructing relevant features and using classifiers to make judgments. It requires only a small amount of labeled training data for the classifier. Ref. [5] employed various machine learning methods such as Naive Bayes, maximum entropy classification, and support vector machines for sentiment analysis, resulting in promising performance. Ref. [13] applied an improved Naive Bayes method and achieved exceptional performance on the restaurant review dataset. Despite the achievements of machine learning in the field of sentiment analysis, the process of feature engineering requires domain experts, involves significant human resources, and suffers from deficiencies in generalization.

In the field of sentiment analysis, deep learning methods have achieved remarkable success in recent years. These approaches do not require manual feature engineering. Instead, models can learn task-relevant features from large-scale data and exhibit strong generalization capability. Ref. [14] applied BiLSTM [15] to a social signal sentiment classification task. Ref. [16] introduced a global RNN-based sentiment classification method for analyzing sentiment in Weibo comments and successfully demonstrated its effectiveness. Ref. [17] proposed a novel attention mechanism that combines LSTM and TCN networks to direct the model’s focus towards words associated with emotion. Ref. [18] proposed an Attention-based Bidirectional CNN-RNN Deep Model (ABCDM) and performed well on the sentiment analysis task. Ref. [19] incorporated an attention mechanism into multi-channel CNN and BiGRU for experimentation, resulting in improved classification outcomes. Ref. [20] conducted an extensive review of the deep learning methods frequently employed in sentiment analysis. With the introduction of pre-trained models such as BERT [21], many studies [9,22,23,24,25] have also tried to implement sentiment analysis based on pre-trained models. Unlike algorithmic research that focuses on domain-specific datasets, we aim to investigate and apply methodologies in more generalized data scenarios. Consequently, we have developed a multi-scenario sentiment classification dataset encompassing various real-world contexts, including e-commerce.

2.2. Self-Knowledge Distillation

Knowledge distillation [26] is a technique that transfers knowledge from a complex and large neural network (the teacher network) to a simpler and smaller one (the student network). This process helps to improve the performance of the smaller network without incurring additional costs. In contrast to knowledge distillation, self-distillation does not rely on a teacher network and instead achieves network improvement on its own. Currently, there have been numerous studies applying the self-distillation to various tasks. Ref. [27] proposed a method called progressive self-knowledge distillation (PS-KD) to soften hard targets during training. Ref. [28] proposed an efficient self-distillation method named Zipf’s Label Smoothing (Zipf’s LS), which uses the on-the-fly prediction of a network to generate soft supervision that conforms to a Zipf distribution without using any contrastive samples or auxiliary parameters. Ref. [29] utilized soft labels generated from the previous batch to supervise the current batch, aiming to achieve model self-distillation. Ref. [30] proposed a self-distillation strategy based on soft labels applied to language modeling and machine translation tasks. Ref. [31] utilized self-knowledge distillation for text summarization to address issues with maximum-likelihood training on single reference and noisy datasets. Most of the previous self-distillation methods require additional network results or complex designs, which is not conducive to the expansion of the algorithm and its application to other tasks. In addition, most of the previous self-distillation methods are applied to image tasks. In this paper, we propose a novel and simple self-distillation method that utilizes two exponential moving average (EMA) models to alleviate model perplexity and enhance model generalization capability.

2.3. Prototype Contrastive Learning

In the field of sentiment analysis, data often come from various textual sources such as social media posts, product reviews, and news reports. Each of these sources exhibits unique language styles and modes of emotional expression, resulting in data heterogeneity. This heterogeneity poses challenges to the accuracy and effectiveness of sentiment analysis.

Recently, researchers have proposed various methods to address data heterogeneity in sentiment analysis. For example, Ref. [32] proposed a sentiment analysis method based on a heterogeneous multi-relational signed network, highlighting the importance of constructing dense heterogeneous signed information networks. Their approach extracts latent relationships between similar node types and uses sentiment inversion and meta-path similarity for relationship prediction. Ref. [33] analyzed 23 million headlines from 47 news media outlets from 2000 to 2019, using the SiEBERT model fine-tuned on diverse datasets to predict sentiment polarity. Their results demonstrate that Transformer models can reliably measure sentiment from different text sources, enhancing the generalization and accuracy of sentiment analysis models. Despite the progress made, the issue of heterogeneity among mixed data remains unresolved.

To address the aforementioned issue, we have explored the potential of prototype contrastive learning [34]. Prototype contrastive learning is a representation learning method that introduces prototypes as centroids of clusters of similar samples, effectively representing the high-dimensional data of each category. During training, it assigns each sample to multiple prototypes at different granularities, encouraging image embeddings to be closer to their corresponding prototypes. This process enhances the model’s ability to distinguish between different categories and optimizes performance by minimizing the distance between similar samples and prototypes while maximizing the distance between different categories. Inspired by prototype contrastive learning, this paper proposes a supervised contrastive learning method for text sentiment classification. This method aims to mitigate sample heterogeneity within the same category by establishing multiple prototypes within each category, effectively addressing the heterogeneity among mixed data.

3. Public Opinion Dataset Construction

We are conducting research focused on analyzing Chinese public opinion. Most existing research is limited to specific scenarios, which do not align with our goals. As a result, we have curated a multi-scenario comment sentiment dataset using data from previous studies. It is important to note that we did not gather new data but rather refined and combined existing datasets [35] https://raw.githubusercontent.com/SophonPlus/ChineseNlpCorpus/master/datasets/ChnSentiCorp_htl_all/ChnSentiCorp_htl_all.csv, https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/yf_dianping/intro.ipynb, https://github.com/SophonPlus/ChineseNlpCorpus/raw/master/datasets/online_shopping_10_cats/online_shopping_10_cats.zip, https://github.com/SophonPlus/ChineseNlpCorpus/raw/master/datasets/waimai_10k/waimai_10k.csv, https://www.kaggle.com/utmhikari/doubanmovieshortcomments, https://pan.baidu.com/s/1DoQbki3YwqkuwQUOj64R_g, accessed on 22 June 2024. In building this dataset, we have prioritized including a diverse range of scenarios. We have sourced reviews from various platforms such as e-commerce, group buying, movies, and social media to ensure the dataset’s diversity. The specific number of each scene in the dataset is shown in Table 1. The dataset comprises only two labels, 0 and 1, where 1 denotes positive sentiment and 0 denotes negative sentiment. The ratio of the two label distributions is approximately 1:1. We divided this dataset into training, validation, and test sets in the ratio of 8:1:1.

4. Method

4.1. The Framework

Figure 2 illustrates the structure of our method, which consists of two main modules: the EMA self-distillation module and the prototypical supervised contrastive learning module. As shown in Figure 2, we use two EMA (exponential moving average) models to produce soft labels as extra guidance, reducing model confusion. The prototypical supervised contrastive learning module pulls in representations of the same classes, thereby mitigating data heterogeneity.

4.2. Self-Distillation from the Dual EMA Model

4.2.1. Revisit of Knowledge Distillation and Self-Knowledge Distillation

Knowledge distillation [26] aims to transfer knowledge from a large and complex network (teacher network) to a smaller and simpler network (student network). This process enhances the performance of the smaller network without incurring extra expenses. There are two types of KD techniques: logit-based methods and intermediate representation methods. The former utilizes the teacher’s output probabilities as auxiliary signals to train a smaller model, called the student. On the other hand, the latter uses the teacher’s intermediate representations to train the student model. In this paper, we focus on the logit-based distillation method. The equation is as follows:

L_{K D} = α D_{c l s} (p_{s}, y) + (1 - α) D_{k d} (p_{s}, p_{t})

(1)

where

L_{K D}

is the knowledge distillation loss. Typically, Kullback–Leibler divergence is used as the knowledge distillation loss.

D_{c l s}

is the classification loss. Typically, cross-entropy is used as the classification loss. The

α

is a hyperparameter that determines the balance between two types of loss.

p_{s}

and

p_{t}

are the student logits and teacher logits. In contrast to knowledge distillation, self-distillation does not require an additional teacher model.

p_{t}

is generally obtained from the current model.

4.2.2. Self-Distillation from the Dual EMA Model

In this paper, we use the EMA model to generate soft labels as additional knowledge. As shown in Figure 2, we used two dual EMA models’ logits as additional supervised signals. The equations are as follows:

Θ_{e m a 1, t + 1} = ε_{1} Θ_{e m a 1, t} + (1 - ε_{1}) Θ_{s, t}

(2)

Θ_{e m a 2, t + 1} = ε_{2} Θ_{e m a 2, t} + (1 - ε_{2}) Θ_{s, t}

(3)

where

Θ_{e m a}

is the parameter of EMA model parameter.

Θ_{s}

is the parameter of the backbone model.

Θ_{S, t}

,

Θ_{e m a 1, t}

, and

Θ_{e m a 2, t}

represent the parameters of the model at moment t.

ε_{1}

and

ε_{2}

are two hyperparameters that represent the degree of smoothing of the two EMA model parameter. The larger

ε

is, the slower the parameter of

Θ

are updated. For

ε_{1}

, we set the range of values for

ε_{1}

to be between 0 and 0.5. And for

ε_{2}

, we set

ε_{2}

to a value close to 1. In this way, the model can learn different levels of soft-labeling knowledge.

We use the logits obtained by the two EMA models as additional knowledge to supervise the backbone model. We also use KL divergence as an additional loss function. The equations are as follows:

p_{s} = F (Θ_{s}; x)

(4)

p_{e m a 1} = F (Θ_{e m a 1}; x)

(5)

p_{e m a 2} = F (Θ_{e m a 2}; x)

(6)

L_{S e l f K D} = (1 - \frac{t}{T}) D_{k l} (p_{s}, p_{e m a 1}) + \frac{t}{T} D_{k l} (p_{s}, p_{e m a 2})

(7)

where

F

is the model and

Θ

is the weight of the model. t represents the current training epoch and T represents the total number of epochs to be trained.

D_{k l}

is the Kullback–Leibler Divergence.

4.3. Prototypical Supervised Contrastive Learning

In the task of text sentiment classification, handling data heterogeneity is a significant challenge. Label heterogeneity involves the diversity and complexity of the content under the same label, making it difficult for traditional classification models to effectively manage.

Inspired by prototype contrastive learning [34,36], this paper introduces a dual-branch architecture for prototype-supervised contrastive learning. This structure consists of a feature learning branch based on prototype-supervised contrastive learning and a classification branch that utilizes cross-entropy loss. This method enhances the model’s ability to handle intra-class diversity by constructing and maintaining a prototype for each category, where the prototype is defined as the central representation of the samples in that category. During training, the model not only optimizes the similarity between samples and their corresponding prototypes but also aims to enhance the distinction between samples and prototypes of other categories.

Specifically, we use a pretrained BERT model to extract deep semantic features from the text, and in the feature learning branch, we initialize prototypes

p_{c}

for each category. These prototypes are obtained by calculating the average features of samples under each category label, represented by the following equation:

p_{c} = \frac{1}{N_{c}} \sum_{i = 1}^{N_{c}} z_{i},

(8)

where

N_{c}

is the number of samples in category c and

z_{i}

is the feature vector of the

i - t h

sample in category c. These prototypes, acting as centroids in the metric space, encourage samples from the same category to cluster around them, mitigating the issue of data heterogeneity.

During the training process, the similarity between the features of samples and their corresponding category prototypes is calculated and compared with the similarity to prototypes of other categories. Using

L_{S c l}

to optimize prototypes, the goal is to minimize the distance between each sample and its corresponding category prototype, while maximizing the distance from the prototypes of other categories. The equation is as follows:

L_{S c l} (z_{i}) = - log \frac{exp (z_{i} \cdot p_{c_{i}} / τ)}{\sum_{j = 1 j \neq i}^{C} exp (z_{i} \cdot p_{c_{j}} / τ)},

(9)

where

p_{c_{i}}

represents the prototype representation of the class

c_{i}

,

τ

represents the temperature scaling factor, and C is the total number of categories in the dataset. This

L_{S c l}

alleviates data heterogeneity issues by computing the similarity between sample features and their corresponding category prototypes, comparing these similarities with those to other category prototypes and using

L_{S c l}

to optimize the representation of prototypes.

Finally, we utilize the feature parameters learned in the feature learning branch to update the parameters of the encoder, which are then passed to the classifier branch. The classifier branch optimizes the discrepancy between the model’s predicted probability distribution and the actual label’s probability distribution using the cross-entropy loss

L_{C E}

. The calculation formula is as follows:

L_{C E} = - \sum_{i = 1}^{N} c_{i} log (v_{i}),

(10)

where

c_{i}

are the true labels and

v_{i}

are the predicted probabilities. This process aims to refine the feature learning branch for specific categories using the updated prototypes. This not only adjusted the weights of the classifier learning branch but also effectively reduced classification bias, thereby significantly enhancing classification accuracy.

4.4. Training Objectives

In our framework, there are three loss functions:

L_{S e l f K D}

,

L_{S c l}

, and

L_{C E}

. The final training loss equation

L_{O v e r a l l}

is as follows:

L_{O v e r a l l} = L_{C E} + β L_{S c l} + (1 - β) L_{S e l f K D}

(11)

where

β

is a hyperparameter used to reconcile the weights of

L_{S c l}

and

L_{S e l f K D}

.

5. Experiments

In this section, we will validate the effectiveness of our method on the opinion analysis dataset and make further analysis and discussion.

5.1. Evaluation Indexes

In this paper, commonly used evaluation indexes in sentiment analysis are selected, including accuracy, precision, recall, and F1. The equation is as follows:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(12)

P r e c i s i o n = \frac{T P}{T P + F P}

(13)

R e c a l l = \frac{T P}{T P + F N}

(14)

F 1 = \frac{2 \times P \times R}{P + R}

(15)

as shown in Table 2,

T P

,

F P

,

T N

and

F N

are the numbers of correctly classified positive comment samples, incorrectly classified positive comment samples, correctly classified negative comment samples and incorrectly classified negative comment samples, respectively.

5.2. Experiments Setting

During specific experiments, the BERT encoder is trained using AdamW [37] for 10 epochs. The learning rate is gradually decreased from the initial learning rate set to 0, following a five percent warm-up period. We set the mini-batch size to 64 and the initial learning rate to 0.0001. We set the hyperparameter

ε_{1}

to 0.2, and the hyperparameter

ε_{2}

to 0.9999. We set the hyperparameter

β

to 0.4. The experiments are conducted on an NVIDIA RTX 3090 24 GB GPU. We implement our code using the Pytorch [38] and Huggingface Transformers [39] libraries.

5.3. Comparison with Other Methods

To demonstrate the effectiveness of our method, we compare it with other advanced sentiment analysis baseline methods:

Word2Vec-BiLSTM [40]: This method uses word2vec for word embeddings, followed by a BILSTM for further feature extraction and classification.
Word2Vec-TextCNN [41]: This method uses word2vec for word embeddings, followed by a TextCNN for further feature extraction and classification.
BERT [21]: BERT is a pre-trained language model based on the Transformer [42], which captures deep bi-directional representations by pre-training on large-scale textual data. Then, fine-tuning training is used on the opinion dataset.
BERT-BiLSTM: This method uses BERT for word embeddings followed by BILSTM for further sequence feature aggregation classification.
BERT-RCNN: This method uses BERT for word embeddings followed by a Recurrent Convolutional Neural Network (RCNN) [43] for further sequence feature aggregation and classification.
BERT-TextCNN: This method uses BERT for word embeddings followed by a TextCNN for further sequence feature aggregation and classification.
BERT-DPCNN: This method uses BERT for word embeddings followed by a Deep Pyramid Convolutional Neural Network (DPCNN) [44] for further sequence feature aggregation and classification.

To validate the effectiveness of our method, we compared it with other current state-of-the-art methods. The experimental findings are presented in Table 3. Our method outperforms all the existing methods in precision, F1 score, and accuracy, which underscores the competitiveness of our method. When compared to the previous state-of-the-art method, our method recorded a 1.07% improvement in F1 score. Notably, when compared to the robust baseline model BERT, our method achieved a remarkable 1.21% increase in F1 score, which is highly impressive.

5.4. Ablation Study

In this section, we will provide a detailed analysis of the various modules in our proposed method. To conduct thorough ablation experimental studies, we have chosen BERT as our baseline model. As shown in Table 4, the empirical findings presented in the ablation study support our claims and reveal three observations.

Each of our modules shows some performance improvement, which proves the effectiveness of each module of our approach.
Although there has been some improvement, the prototypical supervised contrastive learning module’s performance only marginally improved (0.10%, 0.23%). This suggests limited progress from a representation standpoint.
The self-distillation module has a more significant performance boost (0.70%, 0.17%) than the contrastive learning module. This demonstrates the significance of using soft labeling for model learning and emphasizes the motivation behind the research in this paper. In addition to this, we observe that two EMA models have a 0.17% performance improvement over a single EMA model, which indicates that two EMA models can provide different soft label knowledge for model training.

5.5. Further Analysis

5.5.1. Case Study

In this section, we will present additional evidence of the effectiveness of our method by analyzing specific examples. To visually demonstrate the performance of our method, we conducted an attribution study using Captum [45] that identified the words which contributed to the results. As shown in Figure 3, two examples are predicted incorrectly by the baseline model (BERT), but our method achieved correct predictions. We observe that in both examples, different sentiment tendencies appear in the same sentence. Compared to the BERT model, our method can indeed mitigate this type of problem by focusing more on the desired words and minimizing the impact of other unimportant emotional tendencies on the prediction results, so that we can predict the correct results.

5.5.2. Error Analysis

To further explore the problems with the current approach, we tried to find multiple examples of prediction errors. As shown in Figure 4, we observe four cases of prediction error:

Negative adverbs + positive adjectives. Identifying double negation cases can be challenging with the current method. This is because the cognitive power of the baseline method is limited due to parameter constraints. we will try to solve the problem by using models with more parameters such as Llama [46].
Double emotional expression. In some cases, a sentence may contain multiple emotions. To address this problem, we have used a soft labeling method. However, there may still be some errors made by the model which require further exploration to resolve.
Some words in our dataset are not in the encoder’s word list, which can cause confusion in model training. Subsequent research will further update the word lists to address such issues.
Some of the data are mislabeled. As illustrated in Figure 4 (4), the instance is intended to convey a negative emotion but is labeled as positive. This inconsistency may have arisen due to differences between users’ words and actions during the evaluation process. Fortunately, our model is still able to function effectively despite the labeling error.

5.5.3. Visualization

In this section, we present the effectiveness of our method through visual characterization. Figure 5 displays the T-SNE [47] plots of the representations on the testing set. It is evident that our method results in a more compact grouping of examples with the same label while the distribution of the embeddings learned with BERT is not compact. This indicates that our method yields a more consistent and uniform representation by implementing contrastive learning.

6. Conclusions

With the vigorous development of social media and the Internet, user-generated text comment data have experienced explosive growth, providing rich resources for public opinion analysis. Among these data, sentiment analysis occupies a central position because it can assist enterprises and governments in gaining insight into the public’s emotional tendencies and formulating corresponding strategies. This article proposes an innovative model that can accurately extract emotional information from text, greatly facilitating public opinion analysis. This model can not only judge and warn the public of the heat of online public opinion when the government is dealing with emergencies, but also evaluate product quality by analyzing user comments on e-retail websites, and even measure the weight of movie ratings on movie websites. It is particularly noteworthy that this model can effectively eliminate the interference of meaningless comments on the overall prediction results, providing more accurate and efficient services for government departments and the e-commerce field. In order to promote the in-depth study of opinion analysis, this article also collects and organizes a new real multi-scenario user comment dataset. The self-distillation contrastive learning method proposed by us can effectively deal with the heterogeneity and complexity challenges brought by multi-real scenario data. Looking forward to the future, we will be committed to applying this method to real topic comments to further promote the research progress of opinion analysis.

Author Contributions

Conceptualization and data curation: D.Z. and W.H.; methodology: L.S. and B.W.; visualization: L.S. and B.W.; writing—review and editing: D.Z., L.S. and B.W.; supervision: H.X. and W.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the project of National Natural Science Foundation of China (NSFC) “Research on the information participation behavior calibration, trajectory aggregation and targeted guidance for Internet public sentiment audiences in major emergencies” (Grant No. 72174072).

Data Availability Statement

You can view the collected dataset at the following link: https://github.com/shilida/Online_Public_Opinion_Dataset accessed on 26 June 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Turney, P.D. Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. arXiv 2002, arXiv:cs/0212032. [Google Scholar]
Nasukawa, T.; Yi, J. Sentiment analysis: Capturing favorability using natural language processing. In Proceedings of the 2nd International Conference on Knowledge Capture, Sanibel Island, FL, USA, 23–25 October 2003; pp. 70–77. [Google Scholar]
Taboada, M.; Brooke, J.; Tofiloski, M.; Voll, K.; Stede, M. Lexicon-based methods for sentiment analysis. Comput. Linguist. 2011, 37, 267–307. [Google Scholar] [CrossRef]
Feldman, R. Techniques and applications for sentiment analysis. Commun. ACM 2013, 56, 82–89. [Google Scholar] [CrossRef]
Pang, B.; Lee, L.; Vaithyanathan, S. Thumbs up? Sentiment classification using machine learning techniques. arXiv 2002, arXiv:cs/0205070. [Google Scholar]
Barbosa, L.; Feng, J. Robust sentiment detection on twitter from biased and noisy data. In Proceedings of the Coling 2010: Posters, Beijing, China, 23–27 August 2010; pp. 36–44. [Google Scholar]
Zhao, W.; Guan, Z.; Chen, L.; He, X.; Cai, D.; Wang, B.; Wang, Q. Weakly-supervised deep embedding for product review sentiment analysis. IEEE Trans. Knowl. Data Eng. 2017, 30, 185–197. [Google Scholar] [CrossRef]
Vateekul, P.; Koomsubha, T. A study of sentiment analysis using deep learning techniques on Thai Twitter data. In Proceedings of the 2016 13th International Joint Conference on Computer Science and Software Engineering (JCSSE), Khon Kaen, Thailand, 13–15 July 2016; pp. 1–6. [Google Scholar]
Gao, Z.; Feng, A.; Song, X.; Wu, X. Target-dependent sentiment classification with BERT. IEEE Access 2019, 7, 154290–154299. [Google Scholar] [CrossRef]
Singh, M.; Jakhar, A.K.; Pandey, S. Sentiment analysis on the impact of coronavirus in social life using the BERT model. Soc. Netw. Anal. Min. 2021, 11, 33. [Google Scholar] [CrossRef] [PubMed]
Pang, B.; Lee, L. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. arXiv 2004, arXiv:cs/0409058. [Google Scholar]
Turney, P.D.; Littman, M.L. Measuring praise and criticism: Inference of semantic orientation from association. ACM Trans. Inf. Syst. (Tois) 2003, 21, 315–346. [Google Scholar] [CrossRef]
Kang, H.; Yoo, S.J.; Han, D. Senti-lexicon and improved Naïve Bayes algorithms for sentiment analysis of restaurant reviews. Expert Syst. Appl. 2012, 39, 6000–6010. [Google Scholar] [CrossRef]
Brueckner, R.; Schulter, B. Social signal classification using deep blstm recurrent neural networks. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Cheng, J.; Li, P.; Ding, Z.; Zhang, S.; Wang, H. Sentiment classification of Chinese microblogging texts with global RNN. In Proceedings of the 2016 IEEE First International Conference on Data Science in Cyberspace (DSC), Changsha, China, 13–16 June 2016; pp. 653–657. [Google Scholar]
Cao, D.; Huang, Y.; Li, H.; Zhao, X.; Zhao, Q.; Fu, Y. Text Sentiment Classification Based on LSTM-TCN Hybrid Model and Attention Mechanism. In Proceedings of the 4th International Conference on Computer Science and Application Engineering, Sanya, China, 20–22 October 2020. [Google Scholar] [CrossRef]
Basiri, M.E.; Nemati, S.; Abdar, M.; Cambria, E.; Acharya, U.R. ABCDM: An attention-based bidirectional CNN-RNN deep model for sentiment analysis. Future Gener. Comput. Syst. 2021, 115, 279–294. [Google Scholar] [CrossRef]
Cheng, Y.; Yao, L.; Xiang, G.; Zhang, G.; Tang, T.; Zhong, L. Text Sentiment Orientation Analysis Based on Multi-Channel CNN and Bidirectional GRU with Attention Mechanism. IEEE Access 2020, 8, 134964–134975. [Google Scholar] [CrossRef]
Wadawadagi, R.; Pagi, V. Sentiment analysis with deep neural networks: Comparative study and performance assessment. Artif. Intell. Rev. 2020, 53, 6155–6195. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Hoang, M.; Bihorac, O.A.; Rouces, J. Aspect-based sentiment analysis using bert. In Proceedings of the 22nd Nordic Conference on Computational Linguistics, Turku, Finland, 30 September–2 October 2019; pp. 187–196. [Google Scholar]
Li, X.; Bing, L.; Zhang, W.; Lam, W. Exploiting BERT for end-to-end aspect-based sentiment analysis. arXiv 2019, arXiv:1910.00883. [Google Scholar]
Yan, C.; Liu, J.; Liu, W.; Liu, X. Research on public opinion sentiment classification based on attention parallel dual-channel deep learning hybrid model. Eng. Appl. Artif. Intell. 2022, 116, 105448. [Google Scholar] [CrossRef]
Qin, Y.; Shi, Y.; Hao, X.; Liu, J. Microblog Text Emotion Classification Algorithm Based on TCN-BiGRU and Dual Attention. Information 2023, 14, 90. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Kim, K.; Ji, B.; Yoon, D.; Hwang, S. Self-knowledge distillation with progressive refinement of targets. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6567–6576. [Google Scholar]
Liang, J.; Li, L.; Bing, Z.; Zhao, B.; Tang, Y.; Lin, B.; Fan, H. Efficient one pass self-distillation with zipf’s label smoothing. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 104–119. [Google Scholar]
Shen, Y.; Xu, L.; Yang, Y.; Li, Y.; Guo, Y. Self-distillation from the last mini-batch for consistency regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11943–11952. [Google Scholar]
Hahn, S.; Choi, H. Self-knowledge distillation in natural language processing. arXiv 2019, arXiv:1908.01851. [Google Scholar]
Liu, Y.; Shen, S.; Lapata, M. Noisy self-knowledge distillation for text summarization. arXiv 2020, arXiv:2009.07032. [Google Scholar]
Zhao, Q.; Yu, C.; Huang, J.; Lian, J.; An, D. Sentiment analysis based on heterogeneous multi-relation signed network. Mathematics 2024, 12, 331. [Google Scholar] [CrossRef]
Rozado, D.; Hughes, R.; Halberstadt, J. Longitudinal analysis of sentiment and emotion in news media headlines using automated labelling with Transformer language models. PLoS ONE 2022, 17, e0276367. [Google Scholar] [CrossRef]
Li, J.; Zhou, P.; Xiong, C.; Hoi, S.C. Prototypical contrastive learning of unsupervised representations. arXiv 2020, arXiv:2005.04966. [Google Scholar]
Zhang, Y.; Lai, G.; Zhang, M.; Zhang, Y.; Liu, Y.; Ma, S. Explicit factor models for explainable recommendation based on phrase-level sentiment analysis. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, Queensland, Australia, 11 July 2014; pp. 83–92. [Google Scholar]
Wang, P.; Han, K.; Wei, X.S.; Zhang, L.; Wang, L. Contrastive learning based hybrid networks for long-tailed image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 943–952. [Google Scholar]
Loshchilov, I.; Hutter, F. Fixing Weight Decay Regularization in Adam. 2018. Available online: https://openreview.net/forum?id=rk6qdGgCZ (accessed on 22 June 2024).
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 8026–8037. [Google Scholar]
Wolf, T.; Chaumond, J.; Debut, L.; Sanh, V.; Delangue, C.; Moi, A.; Cistac, P.; Funtowicz, M.; Davison, J.; Shleifer, S.; et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar]
Lestari, V.B.; Utami, E. Combining Bi-LSTM and Word2vec Embedding for Sentiment Analysis Models of Application User Reviews. Indones. J. Comput. Sci. 2024, 13. [Google Scholar] [CrossRef]
Zhang, Y.; Wallace, B. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv 2015, arXiv:1510.03820. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Lai, S.; Xu, L.; Liu, K.; Zhao, J. Recurrent convolutional neural networks for text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015; Volume 29. [Google Scholar]
Johnson, R.; Zhang, T. Deep pyramid convolutional neural networks for text categorization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 562–570. [Google Scholar]
Kokhlikyan, N.; Miglani, V.; Martin, M.; Wang, E.; Alsallakh, B.; Reynolds, J.; Melnikov, A.; Kliushkina, N.; Araya, C.; Yan, S.; et al. Captum: A unified and generic model interpretability library for pytorch. arXiv 2020, arXiv:2009.07896. [Google Scholar]
Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; Casas, D.d.L.; Hendricks, L.A.; Welbl, J.; Clark, A.; et al. Training compute-optimal large language models. arXiv 2022, arXiv:2203.15556. [Google Scholar]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]

Figure 1. Some instances in the dataset. Green represents positive sentiment, while red represents negative sentiment.

Figure 2. The framework of our method of our method. Different colors of the model represent different weighting parameters. In this paper,

E n c o d e r

is BERT.

Θ

is the weight of BERT.

E m b

is the feature obtained by the encoder.

f_{e}

is the projection layer.

p_{s}, p_{e m a 1}, p_{e m a 2}

represent the probability distributions obtained from the main and EMA models, respectively.

L_{C E}, L_{S c l}, L_{S e l f K D}

represent cross-entropy loss, contrast loss, and self-distillation loss, respectively. The calculation procedure is presented in Section 4.

Figure 2. The framework of our method of our method. Different colors of the model represent different weighting parameters. In this paper,

E n c o d e r

is BERT.

Θ

is the weight of BERT.

E m b

is the feature obtained by the encoder.

f_{e}

is the projection layer.

p_{s}, p_{e m a 1}, p_{e m a 2}

represent the probability distributions obtained from the main and EMA models, respectively.

L_{C E}, L_{S c l}, L_{S e l f K D}

represent cross-entropy loss, contrast loss, and self-distillation loss, respectively. The calculation procedure is presented in Section 4.

Figure 3. Instances of case study. True Label and Predict Label represent the correct label and the label predicted by the model, respectively. Green text represents a positive contribution to classification, red indicates a negative contribution, and uncolored text represents no contribution. The same as below.

Figure 4. Instances of prediction errors.

Figure 5. T-SNE plots of the embeddings on test dataset. Left: BERT; Right: Ours; Violet: negative examples; Yellow: positive examples.

Table 1. Number of samples per scenario.

E-Commerce	Group Buying	Social Media	Movies	Overall
108,140	69,310	14,289	50,000	241,739

Table 2. The confusion matrix.

		Prediction Classification
		Positive sentiment	Negative sentiment
Ground Truth	Positive sentiment	TP	FN
	Negative sentiment	FP	TN

Table 3. Experimental results on our public opinion dataset. Each result is obtained from the calculated mean and standard deviation of 6 different random seeds. The best scores are in bold.

Method	Precision	Recall	F1	Accuracy
Word2Vec-BiLSTM	81.45 ± 1.38	80.73 ± 1.82	79.94 ± 2.43	80.79 ± 3.12
Word2Vec-CNN	79.13 ± 2.49	79.48 ± 2.32	79.64 ± 1.95	79.04 ± 3.06
BERT	85.15 ± 01.07	87.35 ± 1.14	86.23 ± 0.42	85.15 ± 1.39
BERT-BiLSTM	83.43 ± 1.73	88.28 ± 1.35	86.37 ± 0.39	85.38 ± 0.87
BERT-TextCNN	86.74 ± 0.76	85.62 ± 3.65	86.04 ± 0.68	86.14 ± 0.76
BERT-RCNN	83.89 ± 1.24	87.93 ± 0.57	86.17 ± 0.14	85.96 ± 0.21
BERT-DPCNN	86.07 ± 1.04	86.01 ± 2.34	86.06 ± 0.23	85.89 ± 0.84
Ours	86.99 ± 0.79	87.63 ± 1.02	87.44 ± 0.27	87.14 ± 0.32

Table 4. Ablation experiment results. “Single” and “Dual” represent the numbers of branches (see Section 4.2.2 and Section 4.3 for details. The best scores are in bold.

Method	F1	$Δ$
BERT	86.11	-
Single-branch	86.21	+0.10
Dual-branch	86.34	+0.23
Single-EMA	87.04	+0.70
Dual-EMA	87.21	+0.17

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, D.; Shi, L.; Wang, B.; Xu, H.; Huang, W. Are Your Comments Positive? A Self-Distillation Contrastive Learning Method for Analyzing Online Public Opinion. Electronics 2024, 13, 2509. https://doi.org/10.3390/electronics13132509

AMA Style

Zhou D, Shi L, Wang B, Xu H, Huang W. Are Your Comments Positive? A Self-Distillation Contrastive Learning Method for Analyzing Online Public Opinion. Electronics. 2024; 13(13):2509. https://doi.org/10.3390/electronics13132509

Chicago/Turabian Style

Zhou, Dongyang, Lida Shi, Bo Wang, Hao Xu, and Wei Huang. 2024. "Are Your Comments Positive? A Self-Distillation Contrastive Learning Method for Analyzing Online Public Opinion" Electronics 13, no. 13: 2509. https://doi.org/10.3390/electronics13132509

APA Style

Zhou, D., Shi, L., Wang, B., Xu, H., & Huang, W. (2024). Are Your Comments Positive? A Self-Distillation Contrastive Learning Method for Analyzing Online Public Opinion. Electronics, 13(13), 2509. https://doi.org/10.3390/electronics13132509

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Are Your Comments Positive? A Self-Distillation Contrastive Learning Method for Analyzing Online Public Opinion

Abstract

1. Introduction

2. Related Work

2.1. Sentiment Analysis

2.2. Self-Knowledge Distillation

2.3. Prototype Contrastive Learning

3. Public Opinion Dataset Construction

4. Method

4.1. The Framework

4.2. Self-Distillation from the Dual EMA Model

4.2.1. Revisit of Knowledge Distillation and Self-Knowledge Distillation

4.2.2. Self-Distillation from the Dual EMA Model

4.3. Prototypical Supervised Contrastive Learning

4.4. Training Objectives

5. Experiments

5.1. Evaluation Indexes

5.2. Experiments Setting

5.3. Comparison with Other Methods

5.4. Ablation Study

5.5. Further Analysis

5.5.1. Case Study

5.5.2. Error Analysis

5.5.3. Visualization

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI