research-article

Open access

Experiments of Supervised Learning and Semi-Supervised Learning in Thai Financial News Sentiment: A Comparative Study

Authors:

Suntarin Sangsavate,

Sukree Sinthupinyo,

Achara ChandrachaiAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 22, Issue 7

Article No.: 197, Pages 1 - 36

https://doi.org/10.1145/3603499

Published: 20 July 2023 Publication History

PDF eReader

Abstract

Sentiment classification is an instrument of natural language processing tasks in text analysis to measure customer feedback from given documents such as product reviews, news, and texts. This research aims to experiment with Thai financial news sentiment classification and evaluate sentiment classification performance. In this research, we show financial news sentiment classification experimental results when comparing supervised and semi-supervised methods. In the research methodology, we use PyThaiNLP to tokenize and remove stopwords and split datasets into 85% of the training set and 15% of the testing set. Next, we classify sentiment using machine learning and deep learning approaches with feature extraction such as bag-of-words, term frequency–inverse document frequency, and word embedding (Word2Vec and Bidirectional Encoder Representations from Transformers (BERT)) in given texts. The results show that support vector machine with the BERT model yields the best performance at 83.38%; in contrast, the random forest classifier with bag-of-words yields the worst performance at 54.10% in the machine learning approach. Another experiment reveals that long short-term memory with the BERT model yields the best performance at 84.07% in contrast to the convolutional neural network with bag-of-words, which yields the worst performance at 69.80% in the deep learning approach. The results imply that support vector machine, convolutional neural network, and long short-term memory are suitable for classifying sentiment in complex structure language. From this study, we observe the importance of sentiment classification tools between supervised and semi-supervised learning, and we look forward to furthering this work.

1 Introduction

Sentiment classification is the automated process of identifying opinions in the text related to Natural Language Processing (NLP), data mining, and computational linguistics [1]. Many organizations realize the importance of sentiment classification to control or improve their products or service quality. In business, sentiment classification measures feedback or customer reviews of products or services. Many sentiment classification techniques have been used in different contexts. Medhat et al. [2] surveyed sentiment analysis algorithms and applications, then classified sentiment analysis techniques into two techniques: a machine learning approach and a lexicon-based approach.

Social media has become a powerful medium, as the growth of the Web 2.0 era has contributed to a large amount of content. Most people use social media such as Facebook, Twitter, and YouTube. Therefore, sentiment classification is applied from NLP to examine users’ opinions automatically. Nevertheless, public sentiment classification must contain keywords or trending topics in several frequencies [3]. Many research studies on sentiment classification topics are reported in foreign languages such as English, Chinese, and Spanish. These studies are categorized by text messages, including opinions, emoticons, voices, and video gesture images. However, text classification in Asian languages is rarely published because of the variety of language families, complex structure, and low resources [4]. From the stated problem, we found an opportunity to study Thai sentiment classification from social media information to express emotions, feelings, and opinions in positive, neutral, and negative sentiments in every language.

As mentioned earlier, this research uses machine learning and deep learning techniques to classify Thai language news sentiment from social media. Machine learning and deep learning approaches will be helpful for social media marketers. Additionally, this work aims to compare sentiment classification performance for complex structure language.

The rest of the article is organized as follows. In Section 2, we describe the structure of Thai word sentiment, which expresses emotions and feelings in sentiment polarity; identify the problem and motivation to develop our models; and describe and highlight sentiment classification techniques. In Section 3, we describe the research methodology, focusing on our experiment. In Section 4, we present the results from our earlier experiment. In Section 5, the obtained results are described. Finally, Section 6 provides our conclusion and contributions to this work.

2 Literature Review

This section describes the literature and theory, including characteristics of the Thai language, Thai word sentiment, NLP, sentiment classification techniques, and related works.

2.1 Characteristics of the Thai Language

The Thai language is a unique writing system with its syllabic alphabets consisting of four parts including consonant, vowel, diphthong and tone. The Thai language is a unique writing system with its syllabic alphabets consisting of three parts: consonant-vowel-consonant-vowel and tone [5]. The Thai language has 87 characters consisting of 44 consonant characters, 18 vowel symbols, 4 tone marks, 5 diacritics, 10 numerals, and 6 other symbols [6], as shown in Figure 1.

Fig. 1.

The Thai language is written from left to right, and one of the unique features is that vowel symbols are placed in front of, back of, above, and below a consonant character. Additionally, tone marks are placed above a consonant, diacritics are written above or below a consonant, and a government official often uses numerals, with no word separation between words [7]. Unlike English or other Roman alphabet writing systems, the Thai language has no capital letters. A space uses sentences and independent phrases in a sense, but no absolute rule of the use of space. Moreover, Thai verbs do not change their forms according to tense or concord, and tense is optionally expressed with auxiliary verbs or time adverbials [8]. An example of the Thai writing system is shown in Figure 2.

Fig. 2.

The Thai language is spoken differently in daily life for social activities in various regions of Thailand. In Northern Thailand, they speak “Lanna” or “Kam Mueang,” which came from the Southwestern Tai language that is widely used in Northwestern Laos, Northeastern Myanmar, and Southern Yunnan province, China [9]. In Northeastern Thailand or Isan, the “Isan” language is spoken, which is modified from the Lao language [10]. However, the south of Isan region, especially Surin, Sisaket, and Buriram provinces are widely spoken Northern Khmer language and closely related to the Khmer's culture [11]. In Southern Thailand, “Pak Thai” is spoken in Nakhon Si Thammarat province, and some words are borrowed from the Peranakans dialect. However, Songkhla, Pattani, Yala, and Narathiwat provinces close to Malaysia widely speak “Yawi,” which is related to the Malay language [12].

As mentioned earlier, the Thai language is a complex structure language in writing and speaking systems. Therefore, one challenge in the NLP research topic in Thai language is algorithm improvement to develop classification performance, fulfill research gaps, and adapt to many interdisciplinary studies.

2.2 Thai Word Sentiment

The Thai language has complex structure because there is no word, phrase, or sentence boundary [13]. Many words can express emotions, feelings, and opinions in positive, neutral, and negative sentiments. Examples of Thai word sentiment include

/Dee-Jai (happy),

/Sia-Jai (sad),

/Ning-ChoiChoi (calm), and

/Tuen-Ten (excite). In the business sector, businesses can measure customer feedback from sentiment analysis instruments to improve their products or services. Positive word sentiment represents the customer satisfaction, and the sentiment of the negative words represents the customers who are unsatisfied.

In recent works, many Thai researchers have focused on Thai sentences because it is a complex structure of words and forms of sentences, with limitations in ambiguity and lack of resources [14]—the related works in Thai sentiment analysis research used various contexts, such as in political context. Prasertdum and Wichadakul [15] used a text mining method to mine keywords in the 2019 Thailand General Election event. The results showed that positive sentiments correlated with the number of votes, especially those of new voters. In a social media marketing context, Chumwatana [16] used a lexicon-based approach to classify product and service reviews in social media to develop sentiment polarity scores in Thai language. Porntrakoon [17] developed SenseComp 2, which used a lexicon-based approach to measure customer reviews [18]. The results showed that sentiment analysis is useful for product and service improvement in the business sector.

As earlier mentioned, one challenge in this study field is Thai language is one of unsegmented language and also has unclear meaning because it can be both positive and negative sentiment [19]. For example,

/Paeng can be translated into English in positive (Elegant) and negative (Expensive). Another example is positive word which combines of two negative words such as

/Gong-Kuam-Tai (Comeback) is a positive sentiment that consisted of

/Gong (Cheat) and

/Kuam-Tai (Death) which represented negative sentiment. Hence, our research tries to solve and develop sentiment classification model by using supervised and semi-supervised method for Thai words sentiment. All sentiments data are divided into three classes such as positive, neutral and negative.

2.3 Natural Language Processing

NLP is a subfield of artificial intelligence and linguistics that enables machines to understand human languages [20]. An objective of NLP is to reduce the gap in communication between machines and humans. NLP is divided into two methods: a grammar-based method and a non-grammar-based method. Grammar-based methods depend on grammar rules and linguistic principles. Non-grammar-based methods use statistical instead of linguistic principles, grammar rules, or other techniques. NLP is an important technique that involves text classification and text segmentation to interpret content sentiment. Aroonmanakun et al. [21] stated that NLP contributes to training a large amount of data and exploring directions in real-time text classification.

Many works use NLP related to sentiment classification in Thai languages. For example, Sriphaew et al. [22] developed NLP for Thai sentiment analysis, such as word segmentation, a part-of-speech tagger, and a sentence boundary detector. Haruechaiyasak and Kongthon [23] developed NLP from a limitation of a dictionary-based approach. They found a solution for tokenizing and normalizing texts with intentional insertion errors—for example, a repeated alphabet at the end of words. Sanguansat [24] developed Paragraph2Vec-based sentiment analysis to measure sentiment from business sectors such as retail, financial banking, and telecommunication services. In recent works, many researchers have focused on NLP usage for text mining in terms of business. For example, Polpinij [25] used multilingual sentiment to measure product review feedback. Deewattananon and Sammapun [26] applied NLP based on the lexicon approach to measure Thai sentiment from customer reviews. Esichaikul and Phumdontree [27] used NLP to apply text extraction in financial news to develop SentiFine, a web-based Thai sentiment analysis. Chansanam and Tuamsuk [28] used NLP to tokenize and segment Thai keywords to perform persistent observation and assessment of individuals in Thailand.

As mentioned earlier, the challenges for NLP in Thai language sentiment classification are language-related issues such as short-length messages, word usage variation, and unbalanced data problems [29]. Additionally, the complexity of Thai words and sentences may use a high-level classification technique to measure sentiment.

2.4 Sentiment Classification Techniques

Sentiment classification is an automated process that uses NLP and text analysis to identify opinions in a text and label them as positive, negative, or neutral, based on the emotions that customers express within them. Many entrepreneurs use sentiment classification to measure customer opinions for product or service improvement processes in the business sector. Medhat et al. [2] categorized the sentiment classification technique into machine learning and lexicon-based approaches. The machine learning approach is an applied algorithm that builds a classification model from linguistic features with supervised and unsupervised learning called traditional models. In a recent study, Dang et al. [30] explored an extension of the machine learning approach to semi-supervised learning consisting of a deep learning classifier to improve traditional classifiers’ performance and classification problems at the document, sentence, or aspect level.

In comparison, lexicon-based analysis is a data analysis task that employs sentiment words and phrases without prior knowledge, such as dictionary-based and corpus-based approaches. In recent works, Pandey et al. [31] and Zainuddin et al. [32] combined lexicon-based and machine learning based approaches to improve sentiment polarity and accuracy from existing baseline sentiment classification methods. A taxonomy of sentiment classification techniques is shown in Figure 3.

Fig. 3.

Therefore, unlike traditional systems, there are input data that proceed to generate output data. However, many researchers popularly use machine learning because it provides a mathematical model based on sample data to make prediction models without being explicitly programmed [33]. Hence, the machine learning approach will be utilized to find an improved prediction model in this research.

2.4.1 Machine Learning Approach.

The machine learning approach relies on algorithms that build a classification model from labeled linguistic features. The aim of this approach in sentiment classification work is to detect sentiment that reflects customer feedback automatically. The machine learning approach consists of three types: supervised learning, semi-supervised learning, and unsupervised learning.

Supervised Learning. The supervised learning method is a machine learning approach trained on a dataset with labeled classes [34]. The training data involves a set of labeled training documents. In text classification, supervised learning has many algorithms to build a classification model, such as decision tree classifiers, linear classifiers, rule-based classifiers, and probabilistic classifiers.

Semi-Supervised Learning. The semi-supervised learning method is a machine learning approach trained on a dataset by giving a small number of labeled classes with many unlabeled data during training. In text classification, semi-supervised learning provides a deep learning method to improve classification performance better than traditional models [30]. Examples of semi-supervised learning include deep neural networks, Convolutional Neural Networks (CNNs), and Long Short-Term Memory (LSTM), among others.

Unsupervised Learning. The unsupervised learning method is a machine learning approach that uses a trained dataset with given unlabeled classes that depends on semantic orientation pointwise mutual information, semantic spaces, and distributional similarity to measure between words and polarity [35]. Examples of unsupervised learning include cluster analysis, anomaly detection, and neural networks.

2.4.2 Lexicon-Based Approach.

The lexicon-based approach relies on a study conducted by language experts. The outcome of this study is a set of rules according to word classification, including positive or negative, along with their corresponding intensity measure [36]. The lexicon-based approach is divided into two types of learning: a dictionary-based approach and a corpus-based approach.

Dictionary-Based Approach. The dictionary-based approach refers to a small set of opinion words collected in a sentence. Then, the datasets are grown by searching the corpus (e.g., WordNet) for their synonyms and antonyms [37]. The advantages of the dictionary-based approach are no labeled data and the learning procedure is not required. However, the limitation of the dictionary-based approach is the powerful linguistic resources that are difficult to find.

Corpus-Based Approach. The corpus-based approach relies on solving a problem of opinion words with a specific context. This methodology depends on syntactic patterns and a list of opinion words to find sentiment polarity in a large corpus, such as coordinate construction sentences. The corpus-based approach was introduced by Hatzivassiloglou and McKeown [38] to represent sentiment polarity words by using a conjunction. For example, two words with the same sentiment polarity will use “AND”; however, they use “BUT” to connect opposite sentiment polarity words. This method could be done by using statistical or semantic methods.

2.5 Related Works

In this section, we begin by introducing related works with regard to Thai sentiment classification. For traditional methodology research, many researchers started to use a lexicon-based approach like WordNet [42] and machine learning approaches like decision tree [39], naive Bayes [43, 44, 46], maximum entropy [33], and Support Vector Machine (SVM) [39, 40, 43, 44] to classify texts from product/service reviews, message boards, and news. For deep learning methodology research, many researchers aimed to improve sentiment classification performance and find an appropriate deep learning model like CNN [27, 41, 45–47], LSTM [27, 41, 45–47], or the recurrent neural network [27, 46]. Our research aims to compare the sentiment classification model between supervised and semi-supervised learning and find an appropriate sentiment classification model for Thai financial news. In this section, we introduce related works around the research regarding Thai sentiment classification. A summary of those works is shown in Table 1.

Table 1.

No.	Author (Year) [Ref.]	Research Objective	Sentiment Classification Technique	Source of Content
1	Netisopakul & Chattupan (2015) [39]	To classify stock market news sentiment by using the wordpair feature	DT, SVM	1,381 news
2	Netisopakul et al. (2016) [40]	To explain the cause of misclassification by using 10-fold cross validation	SVM	1,964 sentences (positive = 573, neutral = 940, negative = 451)
3	Vateekul & Koomsubha (2016) [41]	To classify sentiment on Twitter data in Thai by using a well-known deep learning approach	LSTM, DCNN	28,890 documents (positive = 14,445, negative = 14,445)
4	Deewattananin & Sammapun (2017) [42]	To analyze user review aspects and perform sentiment analysis to reveal which features users like or do not like	Dictionary-based approach	1,970 documents (positive = 985, negative = 985)
5	Kadmateekarun et al. (2017) [43]	To study automatic sentiment analysis in cosmetic product reviews	NB, SVM	200 videos (positive = 100, negative = 100)
6	Lertsiwaporn & Senivongse (2017) [44]	To represent the development of a Twitter analysis tool that can collect, analyze, and visualize a set of tweets in Thai	ME, NB, SVM	1,245 documents (positive = 663, negative = 662)
7	Esichaikul & Phumdontree (2018) [27]	To analyze Thai daily financial news by integrating the fine-grained sentiment analysis technique with a deep neural network	BGRU, CNN, LSTM	9,352 news
8	Pasupa & Seneewong Na Ayutthaya (2019) [45]	To compare sentiment analysis based on word embedding, POS-tag, and sentic features	CNN, LSTM	1,115 sentences
9	Piyaphakdeesakun et al. (2019) [46]	To classify sentiment of a Thai document by comparing several approaches and finding the appropriate deep learning	BGRU, CNN, LSTM, NB	41,073 documents (positive = 21,490, negative = 19,583)
10	Pugsee & Ongsirmongkol (2019) [47]	To develop the classification model by two basic types of deep machine learning for Thai sentiment analysis	CNN, LSTM	12,596 reviews (positive = 5,701, neutral = 3,640, negative = 3,255)
11	Our research	To compare sentiment classification techniques and find an appropriate sentiment classification model for Thai financial news	NB, RF, SVM, CNN, LSTM	41,150 financial news

Table 1. Summary of Related Works in the Thai Sentiment Classification Topic

BGRU, bidirectional gated recurrent unit; DT, decision tree; DCNN, dynamic convolutional neural network; ME, maximum entropy; NB, naive Bayes; RF, random forest.

3 Research Methodology

The overview of this research methodology consists of six main steps, including data collection, preprocessing, feature extraction, feature vector, sentiment classification, and sentiment classifier evaluation. The research is divided into six aims to combine ontology and text classification, as shown in Figure 4.

Fig. 4.

The details of each research methodology are explained next.

3.1 Data Collection

In the data collection phase, our experiment was conducted on datasets related to financial news. We retrieved 50,000 financial news tweets between January 1, 2019 and June 30, 2022 from financial news agency Twitter accounts such as Prachachat (@prachachat), Bangkok Biz News (@ktnewsonline), and Thansettakij (@thansettakij) by using the Python programming language modules “Twint” and “Tweepy.” The output of this methodology is a comma-separated values (.csv) file.

3.2 Data Preprocessing

In the data preprocessing phase, we divided our method into two main tasks: data cleansing and tokenization.

3.2.1 Data Cleansing.

We cleaned datasets by removing symbols (e.g., “/”, “|”, “•”), line breaks, and duplicate documents, leaving 41,150 documents. Next, we classified document datasets by labeling the metatag into three categories, including positive, neutral, and negative [48]. Finally, we split documents into two datasets, including 85% of the training set and 15% of the testing set.

3.2.2 Tokenization.

We used the Python programming language module “PyThaiNLP” with the “deepcut” engine to break an enormous paragraph into tokens and remove stopwords that do not affect the sentiment classification process. Tokenization refers to breaking raw text into meaningful data and leaves the information of the text. Hence, raw text may contain some unnecessary words, whereas stopwords will remove unwanted words during the feature extraction process.

3.3 Feature Extraction

In the feature extraction phase, we divided the method into four main tasks: bag-of-words, Term Frequency–Inverse Document Frequency (TF-IDF), word embedding, and n-gram.

3.3.1 Bag-of-Words.

We presented a feature extraction as bag-of-words, a feature extraction used in this research, such as naive Bayes, random forest, and SVM. A bag-of-words refers to a converter that converts words into a unique ID where the word frequency is used as a feature for training a classifier [49].

3.3.2 Term Frequency–Inverse Document Frequency.

TF-IDF refers to identifying the weight of keywords in a document. Term Frequency (TF) is explained for the word count in a document and calculated by the number of times term t appears over the absolute number of terms in document d in Equation (1).

\begin{equation} TF\left( {t,d} \right) = \frac{{Number\ of\ times\ term\ t\ appears\ in\ document\ d}}{{Total\ number\ of\ terms\ in\ document\ d}} \end{equation}

(1)

Inverse Document Frequency (IDF) assists in estimating the importance of words [50] and calculates by log_e the total number of documents over the total number of documents with term t in Equation (2).

\begin{equation} IDF\left( t \right) = lo{g}_e\ \frac{{Total\ number\ of\ documents}}{{Total\ number\ of\ documents\ with\ term\ t\ in\ it}} \end{equation}

(2)

The final weight for term t in document d is calculated in Equation (3).

\begin{equation} TF - IDF\left( {t,d} \right) = TF\left( {t,d} \right)\ \times IDF\left( t \right) \end{equation}

(3)

3.3.3 Word Embedding.

In this research, the deep learning model cannot learn directly from text data because it was intentionally developed for computer vision. Therefore, word embedding is required to transform a word or sentence into a vector. This kind of transformation is called word embedding, which can be done by models such as Word2Vec, GloVe, and ULMFiT, among others.

Word2Vec. Word2Vec was developed by Mikolov et al. [51]. This technique uses a shallow neural network model, which employs a two-layer neural network to learn word embedding and to predict words occurring in similar contexts. In the Thai language context, Thai2Vec is one of the most commonly used Thai word embedding techniques. It was used to transform each word in a sentence into a vector by the UMLFit method [52], which contains 60,000 words in the corpus, and a 300D vector represents each word.

BERT. Bidirectional Encoder Representations from Transformers (BERT) is a transformer-based machine learning technique for NLP pre-training developed by Devlin et al. [53]. In low-resource languages such as Thai, the choices of models are limited to training a BERT model based on a much smaller dataset, such as BERT-th [54]. The BERT model converts a Thai character into a numeric vector and utilizes the pre-train weights of the BERT-base multilingual cased version to fine-tune the language model with the corpus.

In summary, bag-of-words, TF-IDF, and word embedding are used in this study to weight keyword frequency related to Thai financial news topics from social media.

3.3.4 N-gram.

In this research, n-gram would perform document features in a supervised machine learning approach. These are sequences of n tokens from the given document [55]. The value of n can be 1, 2, …, n. If n = 1, it is called a unigram; for n = 2, bigram; for n = 3 trigram, and so on. We used unigram and bigram to compare and investigate how combining two words in financial news would affect each sentiment. The following sentences in Table 2 exemplify different sizes of n-gram models.

Table 2.

3.4 Feature Vector

In this phase, we converted texts into vector form by using “CountVectorizer” to create a feature vector using both bag-of-words and TF-IDF with n-gram, which are used in machine learning.

Before the sentiment classification process, we used k-fold cross validation to represent that the training set is split into k subsets. The purpose of k-fold cross validation is to test the practice of the classification model by dividing the training set (e.g., 5-fold cross validation to split five parts of the dataset) as shown in Figure 5. We compared 5-fold cross validation and 10-fold cross validation experiments.

Fig. 5.

3.5 Sentiment Classification

In the sentiment classification phase, we applied text classification as sentiment classification to analyze news from social media accounts. The research methodology of sentiment classifier modeling can be expressed as follows.

3.5.1 Naive Bayes Classifier.

The naive Bayes classifier refers to the calculation of probabilities by assuming that the probability of each attribute belonging to a given class value is independent of all other attributes based on Bayes’ theorem [56] with solid independent assumptions between features, as shown in Figure 6.

Fig. 6.

The equation of Bayesian rule is as follows:

\begin{equation} P(A|B) = \ \frac{{P\left( A \right)P(B|A)}}{{P\left( B \right)}}, \end{equation}

(4)

where A is a specific class called evidence and B is the classified document. P(A) is the class prior probability of a specific class, and P(B) is the predictor prior probability of a document that cannot be equal to zero. P(A|B) is the posterior probability of the class-given document. In contrast, P(B|A) is the likelihood of probability that document B appears in a specific class A. Hence, the value of class A could be positive, negative, or neutral, whereas B is a document only. The goal is to choose A value to maximize P(A|B). The advantages of the naive Bayes classifier are that it is simple, fast, and highly accurate. However, this classifier's limitation is invalid for the word independence assumption.

3.5.2 Random Forest.

Random forest refers to a supervised learning model based on ensemble learning for classification and regression that operates by constructing multiple decision trees at training time and then aggregating the class to find the final class for the classification model or mean prediction for the regression model. Random forest was introduced by Ho [57] by using the random subspace method. An extension was developed by Breiman [58] by using the bagging idea (also known as bootstrapping) and a random selection feature.

The random forest methodology started by creating a decision tree from 1, …, n. Each decision tree consists of random data from the dataset and feature. The classification from each decision tree will have a majority vote for selecting the best output to be a final class for the classification model or mean prediction for the regression model, as shown in Figure 7.

Fig. 7.

3.5.3 Support Vector Machine.

SVM is a supervised learning model associated with a machine learning algorithm that can be used for classification or regression model challenges based on statistics [59]. In the classification process, SVM performs classification by finding the hyperplane that differentiates the classes we plotted in n-dimensional space. The hyperplane boundary is chosen from the maximum distance between the training samples, as shown in Figure 8.

Fig. 8.

Assuming an input space X, an output space Y, and a training dataset T in Equation (5):

\begin{equation} T = \left\{ {\left( {{x}_i{y}_i} \right),\ i = 1, \ldots ,n} \right\} \in {\left( {X \times Y} \right)}^n, \end{equation}

(5)

where x_i ∈ Rⁿ, y_i ∈ Y = {1, 1}. A separating hyperplane can be written as follows:

\begin{equation} w \cdot x + b = 0, \end{equation}

(6)

where w = {w₁, …, w_n}, w_n is the weight vector of n attributes, x is an object to be classified, and b is a bias, a constant value. The margin between two hyperplanes called Hilbert space is 2/||w||, which aims to maximize the margin. The two hyperplanes can be written as follows:

\begin{equation} \left( {w \cdot x} \right) - b = 1\ \hbox{and}\ \left( {w \cdot x} \right) - b = - 1. \end{equation}

(7)

The SVM classifier is trained using N-pair words and their category. The goal of SVM is to separate negative and positive sentiment datasets. In sentiment classification work, any datasets above the boundary will equal +1 for the positive label, but any datasets below the boundary will equal –1 for the negative label. The usefulness of SVM is offering good classification performance and less dependency, but it needs more transparency in results and may cause the overfitted model.

3.5.4 Convolutional Neural Network.

CNN is a class of artificial neural networks that was introduced by LeCun et al. [60]. CNN consists of three components:

(1)

The convolutional layer is a core component that automatically generates feature maps by sliding a filter over an image.

(2)

The pooling layer is employed to reduce the size of the feature map by combining the outputs of neuron clusters at one layer into a single neuron in the next layer.

(3)

The fully connected layer is the last layer after the convolutional and pooling layers. This layer consists of a perceptron connected between the previous and subsequent layers. This can be calculated into forward or backward diffusion.

An input feature vector is first fed into the convolutional layer, allowing the model to learn information from groups of words through a striding filter. A striding filter has a dimension of w × h, where w is the length of the feature vector and h is the number of words the filter covers at a time. This leads to an output with a size of s × n, where n is the number of nodes in the convolutional model, s is the number of strides equal to h − (l − 1), and l is the number of words in a sentence. Then, the output from the convolutional layer passes through rectified linear unit activation function because the vector from the input to the output layer must be a 1D vector. A 1D dynamic max pooling with a size of s × 1 is required. It strides for n times and gives a 1D output vector that goes to the dropout layer and then to the output layer. An overview of CNN architecture is illustrated in Figure 9.

Fig. 9.

3.5.5 Long Short-Term Memory.

LSTM is a class of artificial recurrent neural network architecture that is used in the field of deep learning. LSTM was introduced by Hochreiter and Schmidhuber [62] to fix the vanishing gradient problem. LSTM is composed of five components:

(1)

The forget gate is the gate that decides what information should be thrown away or kept. Information from the previous hidden state and information from the current input is passed through the sigmoid function (σ). The equation can be written as follows:

\begin{equation}{f}_t = \sigma \ \left( {{x}_t\ \times \ {U}_f + {H}_{t - 1}\ \times \ {W}_f} \right),\end{equation}

(8)

where x_t is an input to the current timestamp t. U_f is the weight associated with the input. H_t_–₁ is a hidden state of the previous timestamp, and W_f is the weight matrix associated with the hidden state. Values come out between 0 and 1. Closer to 0 means forget, and closer to 1 means keep.

(2)

The input gate is the gate that decides which information will be updated, again considered together with the previous hidden state. The sigmoid function will decide how much new information should be updated based on values of 0 to 1, in which 0 means not important and 1 means important. The equation can be written as follows:

\begin{equation}{i}_t = \sigma \ \left( {{x}_t\ \times \ {U}_i + {H}_{t - 1}\ \times \ {W}_i} \right),\end{equation}

(9)

where x_t is an input at the current timestamp t. U_i is the weight matrix of input. H_t_–₁ is a hidden state at the previous timestamp, and W_i is the weighted matrix of input associated with the hidden state.

(3)

The cell state is an LSTM cell that combines old information dropped from the forget gate and new information from the input gate. The equation can be written as follows:

\begin{equation}{c}_t = tanh\ \left( {{x}_t\ \times \ {U}_c + {H}_{t - 1}\ \times \ {W}_c} \right),\end{equation}

(10)

where x_t is an input at the current timestamp t and the activation function is tanh function. The value of the cell state will be between –1 and 1. If the value of c_t is negative, the information will subtract from the cell state at the current timestamp. While c_t is positive, the information will add to the cell state at the current timestamp. U_c is the weight matrix of input. H_t_–₁ is the hidden state at the previous timestamp, and W_c is the weighted matrix of input associated with the hidden state.

(4)

The output gate is the gate that has a role in deciding what the next hidden state should be. The gate sends information to the hidden state restricted to an interval of 0 and 1 by a sigmoid function. The equation can be written as follows:

\begin{equation}{o}_t = \sigma \ \left( {{x}_t\ \times \ {U}_o + {H}_{t - 1}\ \times \ {W}_o} \right),\end{equation}

(11)

where x_t is an input at the current timestamp t. U_o is the weight matrix of input. H_t_–₁ is a hidden state at the previous timestamp, and W_o is the weighted matrix of input associated with the hidden state.

(5)

The hidden state refers to an output of LSTM. This state carries the information on what LSTM has seen. The hidden state uses o_t and tanh functions to the updated cell state. The equation can be written as follows:

\begin{equation}{h}_t = {o}_t\ \times \ tanh\left( {{c}_t} \right),\end{equation}

(12)

where h_t is a current hidden state, o_t is an output gate, and c_t is a current cell state. It turns out that the hidden state is a function of LSTM (c_t) and the current output. If you need to take the output of the current timestamp, just apply the SoftMax activation on hidden state H_t as follows:

\begin{equation} Output = Softmax\left( {{H}_t} \right). \end{equation}

(13)

The overview of LSTM in the text classification model is shown in Figure 10.

Fig. 10.

3.6 Classifier Evaluation

In this phase, we used various instruments to measure sentiment classifier performance.

3.6.1 Inter-Annotator Agreement.

We used inter-annotator agreement based on Cohen's kappa coefficient [61] to evaluate sentiment reliability. The coefficient is defined as follows:

\begin{equation} k = \frac{{ Pr ( a )\ - \ Pr( e )}}{{1\ - \ Pr( e )}}, \end{equation}

(14)

where Pr(a) is the proportion of the cases where both annotators agree and Pr(e) is the proportion we search that the two annotators agree by chance. Interpretation of the k parameter is shown in Table 3.

Table 3.

k	Agreement Level
<0.00	Poor
0.01–0.20	Slight
0.21–0.40	Fair
0.41–0.60	Moderate
0.61–0.80	Substantial
0.81–1.00	Perfect

Table 3. Interpretation of the k Parameter

3.6.2 Confusion Matrix.

We used a confusion matrix to measure sentiment classification performance including the false positive rate (FPR) and the true positive rate (TPR).

\begin{equation} False\ Positive\ Rate\ \left( {FPR} \right) = \frac{{FP}}{{FP + TN}}, \end{equation}

(15)

\begin{equation} True\ Positive\ Rate\ {\rm{\ }}\left( {TPR} \right) = \frac{{TP}}{{TP + FN}}, \end{equation}

(16)

where true positive (TP) represents a number of sentiments that are calculated properly through the class by algorithms, whereas true negative (TN) represents a number of aspects in which sentiment does not assign and is not appropriately calculated through the class by the algorithms. False positive (FP) represents sentiment incorrectly calculated through the class by the algorithms, and false negative (FN) represents a number of aspects that do not assign but is calculated through the class by the algorithms. Regarding the confusion matrix concept, in Table 4, the columns explain the examples of a prediction class, whereas the rows explain the instances of an actual class in the matrix.

Table 4.

	Positive	Negative
True	True positive	True negative
False	False positive	False negative

Table 4. Confusion Matrix of Sentiment Classification

3.6.3 Measurement Factors.

For performance assessment, we used the F-measure, precision, recall, and accuracy to measure the sentiment classification performance of the machine learning approach [62–64]. Those indices are computed based on four measurement factors as follow:

F-measure refers to the weighted average of precision and recall of the computation score. The equation can be written as follows:

\begin{equation} F - measure = 2x\frac{{Precision\ x\ Recall}}{{Precision + Recall}}. \end{equation}

(17)

Precision refers to the ratio of true positive predictions over the total of true positive and false positive predictions. The equation can be written as follows:

\begin{equation} Precision = \ \frac{{TP}}{{TP + FP}}. \end{equation}

(18)

Recall refers to the ratio of true positive predictions over the total of true positive and false negative predictions. The equation can be written as follows:

\begin{equation} Recall = \ \frac{{TP}}{{TP + FN}} \end{equation}

(19)

Accuracy refers to the ratio of the total of true positive predictions and true negative predictions over the total of true positive predictions, true negative predictions, false positive predictions, and false negative predictions. The equation can be written as follows:

\begin{equation} Accuracy = \ \frac{{\left( {TP + TN} \right)}}{{\left( {TP + TN + FP + FN} \right)}} \end{equation}

(20)

3.6.4 Experiment Settings.

In experiment settings, we prepared experimentation to determine which sentiment classification model would be proper for Thai financial news:

(1)

Model: We classified raw text into three models by defined as the triplet: M = {T, R, I}:

T = {News_classe, Positive, Negative, Neutral}

R = {News_classe::= Positive|Negative| Neutral}

I = {Positive: “Subjective with positive sentiment”, Negative: “Subjective with negative sentiment”, Neutral: “out of topic or without sentiment (objective)”}

(2)

Annotate: We let two linguistics researchers annotate a comment to represent a positive, negative, or neutral sentiment with regard to the article topic and used Cohen's kappa coefficient to annotate each comment. The result is shown in Table 5.

Table 5.

Annotator B	Annotator A			Total
	Positive	Neutral	Negative
Positive	16,074	2,080	491	18,645
Neutral	421	8,644	231	9,296
Negative	157	409	12,643	13,209
Total	16,652	11,133	13,365	41,150

Table 5. Annotated Confusion Matrix

The proportion of the cases where both annotators agree is as follows:

\begin{equation} \Pr \left( a \right) = \frac{\hbox{37,361}}{\hbox{41,150}} = 0.9079. \end{equation}

(21)

We first calculated the probability of rating a positive sentiment by chance, then added totals for the rows and columns.

For positive, Annotator A has 16,652 of 41,150 positive sentiment ratings, whereas Annotator B has 18,645 of 41,150 positive sentiment ratings. The probabilities of positive for Annotator A and B are thus as follows:

\begin{equation} ProbA\left( {Positive} \right) = \frac{\hbox{16,652}}{\hbox{41,150}} = 0.4047, \end{equation}

(22)

\begin{equation} ProbB\left( {Positive} \right) = \frac{\hbox{18,645}}{\hbox{41,150}} = 0.4531. \end{equation}

(23)

The probability that both annotators agree on a positive sentiment by chance is equal to the product of ProbA and ProbB as follows:

\begin{equation} ChanceAgree\left( {Positive} \right) = 0.4047\ \times \ 0.4531 = 0.1833. \end{equation}

(24)

Similarly, we calculated the agreement probabilities for neutral sentiment and negative sentiment as follows:

\begin{equation} ProbA\left( {Neutral} \right) = \frac{\hbox{11,133}}{\hbox{41,150}} = 0.2754, \end{equation}

(25)

\begin{equation} ProbB\left( {Neutral} \right) = \frac{\hbox{9,296}}{\hbox{41,150}} = 0.2259, \end{equation}

(26)

\begin{equation} ChanceAgree\left( {Positive} \right) = 0.2754\ \times \ 0.2259 = 0.0622, \end{equation}

(27)

\begin{equation} ProbA\left( {Negative} \right) = \frac{\hbox{13,365}}{\hbox{41,150}} = 0.3247, \end{equation}

(28)

\begin{equation} ProbB\left( {Negative} \right) = \frac{\hbox{13,209}}{\hbox{41,150}} = 0.3210, \end{equation}

(29)

\begin{equation} ChanceAgree\left( {Negative} \right) = 0.3247\ \times \ 0.3210 = 0.1042, \end{equation}

(30)

Summing up the preceding three probabilities, we get the probability of agreement on any of the positive, neutral, and negative sentiments by chance:

\begin{equation} ChanceAgree = 0.1833 + 0.0622 + 0.1042 = 0.3497. \end{equation}

(31)

So, the Kappa score for classification is equal to the following:

\begin{equation} k = \ \frac{{0.9079\ - \ 0.3497}}{{1\ - \ 0.3497}}. \end{equation}

(32)

Table 6.

Positive	Neutral	Negative	Total
18,910	8,662	13,578	41,150

Table 6. Results of Sentiment Polarity

So, k = 0.8584 is considered as perfect.

(3)

Adjudication: Two linguistics researchers had a discussion to obtain a consensus on annotation in the adjudication process. If a consensus was not obtained, they concluded that the news was neutral. The results of sentiment polarity of Thai financial news are shown in Table 6.

(4)

Processing: We provided preprocessing, UTF-8 encoding, tokenization, stemming, stopword removal, n-gram word generation, and word vector creation in the processing step.

(5)

Train and test: We used 5-fold and 10-fold cross validation that adopted a classic machine learning approach and deep learning approach.

(6)

Evaluate: We evaluated our models by calculating the F-measure, precision, and recall of positive, neutral, and negative sentiment classes, and accuracy of classification models.

4 Research Methodology

We collected data from 41,150 financial news documents from January 1, 2020 to December 31, 2021, and divided it into three categories, including 18,909 positive documents, 8,663 neutral documents, and 13,578 negative documents. The ratio of splitting into training and testing datasets was 85:15, respectively. Results of the experiments are presented next.

4.1 Wordcloud

In this phase, a wordcloud was created to monitor significant features easily. Here, we visualize the data from our research by presenting text data in the wordcloud. The most frequent keywords are related to the COVID-19 pandemic, followed by economic issues, as shown in Figure 11.

Fig. 11.

An overview of Thai financial news keywords is illustrated in Figure 12.

Fig. 12.

4.2 Classifier Model Evaluation

In the sentiment classifier evaluation phase, we obtained the results of our experiment on Thai financial news sentiment classification by using machine learning techniques such as naive Bayes, random forest, and SVM, and deep learning techniques such as CNN and LSTM. Our experiment used both 5-fold and 10-fold cross validation to perform feature extractions. Then, sentiment classifier evaluation measured sentiment performance.

4.2.1 First Experiment.

First experiment, we used fivefold cross validation for each classification model, and the results are as follows.

For the naive Bayes classifier, the first unigram experiment combined with TF-IDF provides accuracy at 63.37%, whereas positive sentiment gives an F1-score of 0.45, a precision score of 0.36, and a recall score of 0.58. Neutral sentiment gives an f1-score of 0.65, a precision score of 0.66, and a recall score of 0.64. Negative sentiment gives an f1-score of 0.64, a precision score of 0.69, and a recall score of 0.60. The second experiment using bag-of-words provides accuracy at 62.14%, whereas positive sentiment gives an f1-score of 0.67, a precision score of 0.62, and a recall score of 0.73. Neutral sentiment gives an f1-score of 0.53, a precision score of 0.55, and a recall score of 0.51. Negative sentiment gives an f1-score of 0.68, a precision score of 0.71, and a recall score of 0.61. The third experiment using Word2Vec provides accuracy at 66.99%, whereas positive sentiment gives an f1-score of 0.74, a precision score of 0.70, and a recall score of 0.80. Neutral sentiment gives an f1-score of 0.65, a precision score of 0.62, and a recall score of 0.69. Negative sentiment gives an f1-score of 0.56, a precision score of 0.60, and a recall score of 0.52. Finally, the BERT model provides accuracy at 69.72%, whereas positive sentiment gives an f1-score of 0.73, a precision score of 0.70, and a recall score of 0.75. Neutral sentiment gives an f1-score of 0.64, a precision score of 0.61, and a recall score of 0.68. Negative sentiment gives an f1-score of 0.79, a precision score of 0.78, and a recall score of 0.81.

The first bigram experiment combined with TF-IDF provides accuracy at 68.29%, whereas positive sentiment gives an f1-score of 0.62, a precision score of 0.59, and a recall score of 0.66. Neutral sentiment gives an f1-score of 0.62, a precision score of 0.60, and a recall score of 0.65. Negative sentiment gives an f1-score of 0.62, a precision score of 0.59, and a recall score of 0.65. The second experiment using bag-of-words provides accuracy at 66.32%, whereas positive sentiment gives an f1-score of 0.56, a precision score of 0.59, and a recall score of 0.53. Neutral sentiment gives an f1-score of 0.63, a precision score of 0.65, and a recall score of 0.62. Negative sentiment gives an f1-score of 0.60, a precision score of 0.66, and a recall score of 0.55. The third experiment using Word2Vec provides accuracy at 67.06%, whereas positive sentiment gives an f1-score of 0.65, a precision score of 0.65, and a recall score of 0.64. Neutral sentiment gives an f1-score of 0.63, a precision score of 0.57, and a recall score of 0.71. Negative sentiment gives an f1-score of 0.73, a precision score of 0.69, and a recall score of 0.78. Finally, the BERT model provides accuracy at 70.11%, whereas positive sentiment gives an f1-score of 0.77, a precision score of 0.82, and a recall score of 0.73. Neutral sentiment gives an f1-score of 0.67, a precision score of 0.69, and a recall score of 0.65. Negative sentiment gives an f1-score of 0.74, a precision score of 0.70, and a recall score of 0.77. The results are shown in Table 7.

Table 7.

Feature	Measure	Naive Bayes
		Unigram			Bigram
		Positive	Neutral	Negative	Positive	Neutral	Negative
TF-IDF	F1-score	0.4459	0.6526	0.6434	0.6230	0.6205	0.6231
	Precision	0.3607	0.6620	0.6932	0.5867	0.5946	0.6001
	Recall	0.5839	0.6434	0.6003	0.6640	0.6488	0.6480
	Accuracy	0.6337			0.6829
Bag-of-words	F1-score	0.6667	0.5263	0.6816	0.5599	0.6348	0.6034
	Precision	0.6163	0.5456	0.7103	0.5928	0.6493	0.6611
	Recall	0.7260	0.5083	0.6080	0.5305	0.6210	0.5549
	Accuracy	0.6214			0.6632
Word2Vec	F1-score	0.7427	0.6525	0.5572	0.6466	0.6308	0.7313
	Precision	0.6954	0.6154	0.5965	0.6543	0.5673	0.6924
	Recall	0.7969	0.6944	0.5227	0.6391	0.7102	0.7749
	Accuracy	0.6699			0.6706
BERT	F1-score	0.7253	0.6426	0.7919	0.7694	0.6685	0.7357
	Precision	0.7000	0.6115	0.7783	0.8193	0.6927	0.7043
	Recall	0.7524	0.6770	0.8056	0.7253	0.6460	0.7700
	Accuracy	0.6972			0.7011

Table 7. Experimental Results of Thai Financial News Sentiment Classification Using the Naive Bayes Classifier with Fivefold Cross Validation

For the random forest classifier, the first unigram experiment combined with TF-IDF provides accuracy at 55.55%, whereas positive sentiment gives an f1-score of 0.57, a precision score of 0.66, and a recall score of 0.46. Neutral sentiment gives an f1-score of 0.52, a precision score of 0.73, and a recall score of 0.40. Negative sentiment gives an f1-score of 0.44, a precision score of 0.50, and a recall score of 0.39. The second experiment using bag-of-words provides accuracy at 54.10%, whereas positive sentiment gives an f1-score of 0.58, a precision score of 0.49, and a recall score of 0.73. Neutral sentiment gives an f1-score of 0.47, a precision score of 0.60, and a recall score of 0.38. Negative sentiment gives an f1-score of 0.52, a precision score of 0.47, and a recall score of 0.58. The third experiment using Word2Vec provides accuracy at 60.27%, whereas positive sentiment gives an f1-score of 0.58, a precision score of 0.57, and a recall score of 0.59. Neutral sentiment gives an f1-score of 0.59, a precision score of 0.52, and a recall score of 0.69. Negative sentiment gives an f1-score of 0.56, a precision score of 0.52, and a recall score of 0.60. Finally, the BERT model provides accuracy at 60.69%, whereas positive sentiment gives an f1-score of 0.57, a precision score of 0.50, and a recall score of 0.67. Neutral sentiment gives an f1-score of 0.65, a precision score of 0.70, and a recall score of 0.61. Negative sentiment gives an f1-score of 0.51, a precision score of 0.49, and a recall score of 0.54.

The first bigram experiment combined with TF-IDF provides accuracy at 57.56%, whereas positive sentiment gives an f1-score of 0.56, a precision score of 0.58, and a recall score of 0.54. Neutral sentiment gives an f1-score of 0.57, a precision score of 0.53, and a recall score of 0.60. Negative sentiment gives an f1-score of 0.55, a precision score of 0.61, and a recall score of 0.51. The second experiment using bag-of-words provides accuracy at 56.07%, whereas positive sentiment gives an f1-score of 0.55, a precision score of 0.63, and a recall score of 0.49. Neutral sentiment gives an f1-score of 0.61, a precision score of 0.68, and a recall score of 0.55. Negative sentiment gives an f1-score of 0.56, a precision score of 0.57, and a recall score of 0.55. The third experiment using Word2Vec provides accuracy at 60.98%, whereas positive sentiment gives an f1-score of 0.59, a precision score of 0.61, and a recall score of 0.56. Neutral sentiment gives an f1-score of 0.61, a precision score of 0.67, and a recall score of 0.56. Negative sentiment gives an f1-score of 0.57, a precision score of 0.70, and a recall score of 0.54. Finally, the BERT model provides accuracy at 61.18%, whereas positive sentiment gives an f1-score of 0.63, a precision score of 0.65, and a recall score of 0.60. Neutral sentiment gives an f1-score of 0.57, a precision score of 0.55, and a recall score of 0.59. Negative sentiment gives an f1-score of 0.61, a precision score of 0.72, and a recall score of 0.53. The results are shown in Table 8.

Table 8.

Feature	Measure	Random Forest
		Unigram			Bigram
		Positive	Neutral	Negative	Positive	Neutral	Negative
TF-IDF	F1-score	0.5734	0.5176	0.4365	0.5577	0.5660	0.5540
	Precision	0.6550	0.7299	0.4963	0.5809	0.5340	0.6062
	Recall	0.4556	0.4010	0.3896	0.5362	0.6021	0.5100
	Accuracy	0.5555			0.5756
Bag-of-words	F1-score	0.5829	0.4666	0.5167	0.5514	0.6116	0.5578
	Precision	0.4847	0.5991	0.4682	0.6342	0.6843	0.5698
	Recall	0.7309	0.3821	0.5763	0.4877	0.5529	0.5462
	Accuracy	0.5410			0.5607
Word2Vec	F1-score	0.5776	0.5939	0.5571	0.5853	0.6132	0.5691
	Precision	0.5679	0.5201	0.5176	0.6090	0.6727	0.6980
	Recall	0.5877	0.6922	0.6032	0.5633	0.5633	0.5428
	Accuracy	0.6027			0.6098
BERT	F1-score	0.5720	0.6543	0.5123	0.6264	0.5713	0.6114
	Precision	0.4989	0.7024	0.4869	0.6538	0.5500	0.7241
	Recall	0.6701	0.6123	0.5406	0.6012	0.5944	0.5290
	Accuracy	0.6069			0.6118

Table 8. Experimental Results of Thai Financial News Sentiment Classification Using the Random Forest Classifier with Fivefold Cross Validation

For the SVM classifier, the first unigram experiment combined with TF-IDF provides accuracy at 69.85%, whereas positive sentiment gives an f1-score of 0.74, a precision score of 0.70, and a recall score of 0.78. Neutral sentiment gives an f1-score of 0.66, a precision score of 0.64, and a recall score of 0.68. Negative sentiment gives an f1-score of 0.67, a precision score of 0.76, and a recall score of 0.60. The second experiment using bag-of-words provides accuracy at 67.75%, whereas positive sentiment gives an f1-score of 0.65, a precision score of 0.64, and a recall score of 0.66. Neutral sentiment gives an f1-score of 0.64, a precision score of 0.63, and a recall score of 0.65. Negative sentiment gives an f1-score of 0.67, a precision score of 0.64, and a recall score of 0.70. The third experiment using Word2Vec provides accuracy at 74.25%, whereas positive sentiment gives an f1-score of 0.71, a precision score of 0.77, and a recall score of 0.66. Neutral sentiment gives an f1-score of 0.64, a precision score of 0.63, and a recall score of 0.65. Negative sentiment gives an f1-score of 0.66, a precision score of 0.67, and a recall score of 063. Finally, the BERT model provides accuracy at 78.76%, whereas positive sentiment gives an f1-score of 0.75, a precision score of 0.74, and a recall score of 0.77. Neutral sentiment gives an f1-score of 0.68, a precision score of 0.75, and a recall score of 0.63. Negative sentiment gives an f1-score of 0.68, a precision score of 0.68, and a recall score of 0.68.

The first bigram experiment combined with TF-IDF provides accuracy at 70.58%, whereas positive sentiment gives an f1-score of 0.78, a precision score of 0.66, and a recall score of 0.94. Neutral sentiment gives an f1-score of 0.52, a precision score of 0.72, and a recall score of 0.41. Negative sentiment gives an f1-score of 0.65, a precision score of 1.00, and a recall score of 0.48. The second experiment using bag-of-words provides accuracy at 70.03%, whereas positive sentiment gives an f1-score of 0.74, a precision score of 0.70, and a recall score of 0.80. Neutral sentiment gives an f1-score of 0.69, a precision score of 0.69, and a recall score of 0.70. Negative sentiment gives an f1-score of 0.69, a precision score of 0.69, and a recall score of 0.69. The third experiment is Word2Vec provides accuracy at 77.98%, whereas positive sentiment gives an f1-score of 0.74, a precision score of 0.75, and a recall score of 0.72. Neutral sentiment gives an f1-score of 0.71, a precision score of 0.65, and a recall score of 0.78. Negative sentiment gives an f1-score of 0.67, a precision score of 0.64, and a recall score of 070. Finally, the BERT model provides accuracy at 78.91%, whereas positive sentiment gives an f1-score of 0.79, a precision score of 0.89, and a recall score of 0.71. Neutral sentiment gives an f1-score of 0.84, a precision score of 0.84, and a recall score of 0.84. Negative sentiment gives an f1-score of 0.80, a precision score of 0.79, and a recall score of 0.81. The results are shown in Table 9.

Table 9.

Feature	Measure	SVM
		Unigram			Bigram
		Positive	Neutral	Negative	Positive	Neutral	Negative
TF-IDF	F1-score	0.7355	0.6613	0.6696	0.7752	0.5230	0.6497
	Precision	0.6951	0.6406	0.7576	0.6580	0.7236	0.9997
	Recall	0.7808	0.6833	0.6000	0.9431	0.4095	0.4812
	Accuracy	0.6985			0.7058
Bag-of-words	F1-score	0.6461	0.6439	0.6664	0.7447	0.6946	0.6902
	Precision	0.6369	0.6339	0.6368	0.6978	0.6881	0.6888
	Recall	0.6556	0.6542	0.6989	0.7984	0.7013	0.6917
	Accuracy	0.6775			0.7003
Word2Vec	F1-score	0.7092	0.6434	0.6616	0.7365	0.7102	0.6677
	Precision	0.7679	0.6343	0.6710	0.7500	0.6499	0.6408
	Recall	0.6588	0.6528	0.6265	0.7234	0.7828	0.6969
	Accuracy	0.7425			0.7798
BERT	F1-score	0.7535	0.6831	0.6774	0.7918	0.8420	0.7988
	Precision	0.7358	0.7511	0.6748	0.8927	0.8433	0.7921
	Recall	0.7720	0.6264	0.6801	0.7114	0.8407	0.8056
	Accuracy	0.7876			0.7891

Table 9. Experimental Results of Thai Financial News Sentiment Classification Using SVM with Fivefold Cross Validation

For the CNN classifier, the first unigram experiment combined with TF-IDF provides accuracy at 71.90%, whereas positive sentiment gives an f1-score of 0.75, a precision score of 0.77, and a recall score of 0.73. Neutral sentiment gives an f1-score of 0.59, a precision score of 0.62, and a recall score of 0.56. Negative sentiment gives an f1-score of 0.73, a precision score of 0.78, and a recall score of 0.69. The second experiment using bag-of-words provides an accuracy of 69.80%, whereas positive sentiment gives an f1-score of 0.72, precision score of 0.72, and recall score of 0.73. Neutral sentiment gives an f1-score of 0.51, a precision score of 0.44, and a recall score of 0.59. Negative sentiment gives an f1-score of 0.60, a precision score of 0.74, and a recall score of 0.51. The third experiment using Word2Vec provides accuracy at 77.66%, whereas positive sentiment gives an f1-score of 0.70, a precision score of 0.67, and a recall score of 0.73. Neutral sentiment gives an f1-score of 0.62, a precision score of 0.60, and a recall score of 0.65. Negative sentiment gives an f1-score of 0.71, a precision score of 0.78, and a recall score of 065. Finally, the BERT model provides accuracy at 78.27%, whereas positive sentiment gives an f1-score of 0.74, a precision score of 0.72, and a recall score of 0.77. Neutral sentiment gives an f1-score of 0.66, a precision score of 0.64, and a recall score of 0.68. Negative sentiment gives an f1-score of 0.75, a precision score of 0.80, and a recall score of 0.70.

The first bigram experiment combined with TF-IDF provides accuracy at 72.72%, whereas positive sentiment gives an f1-score of 0.74, a precision score of 0.72, and a recall score of 0.77. Neutral sentiment gives an f1-score of 0.63, a precision score of 0.64, and a recall score of 0.62. Negative sentiment gives an f1-score of 0.59, a precision score of 0.60, and a recall score of 0.57. The second experiment using bag-of-words provides an accuracy of 71.47%, whereas positive sentiment gives an f1-score of 0.75, precision score of 0.75, and recall score of 0.76. Neutral sentiment gives an f1-score of 0.53, a precision score of 0.55, and a recall score of 0.51. Negative sentiment gives an f1-score of 0.63, a precision score of 0.82, and a recall score of 0.52. The third experiment using Word2Vec provides accuracy at 79.89%, whereas positive sentiment gives an f1-score of 0.70, a precision score of 0.77, and a recall score of 0.64. Neutral sentiment gives an f1-score of 0.66, a precision score of 0.64, and a recall score of 0.67. Negative sentiment gives an f1-score of 0.65, a precision score of 0.70, and a recall score of 0.60. Finally, the BERT model provides accuracy at 80.64%, whereas positive sentiment gives an f1-score of 0.75, a precision score of 0.70, and a recall score of 0.81. Neutral sentiment gives an f1-score of 0.66, a precision score of 0.68, and a recall score of 0.63. Negative sentiment gives an f1-score of 0.64, a precision score of 0.62, and a recall score of 0.66. The results are shown in Table 10.

Table 10.

Feature	Measure	CNN
		Unigram			Bigram
		Positive	Neutral	Negative	Positive	Neutral	Negative
TF-IDF	F1-score	0.7471	0.5882	0.7295	0.7448	0.6315	0.5873
	Precision	0.7656	0.6190	0.7774	0.7226	0.6398	0.6038
	Recall	0.7294	0.5603	0.6872	0.7685	0.6234	0.5717
	Accuracy	0.7190			0.7272
Bag-of-words	F1-score	0.7248	0.5054	0.6040	0.7538	0.5281	0.6332
	Precision	0.7200	0.4393	0.7393	0.7453	0.5481	0.8202
	Recall	0.7297	0.5949	0.5106	0.7625	0.5095	0.5157
	Accuracy	0.6980			0.7147
Word2Vec	F1-score	0.6999	0.6231	0.7109	0.7000	0.6582	0.6472
	Precision	0.6721	0.5993	0.7804	0.7690	0.6440	0.7009
	Recall	0.7302	0.6489	0.6528	0.6423	0.6731	0.6011
	Accuracy	0.7766			0.7989
BERT	F1-score	0.7440	0.6588	0.7483	0.7507	0.6564	0.6398
	Precision	0.7159	0.6434	0.8042	0.6965	0.6842	0.6238
	Recall	0.7744	0.6750	0.6997	0.8141	0.6308	0.6566
	Accuracy	0.7827			0.8064

Table 10. Experimental Results of Thai Financial News Sentiment Classification Using CNN with Fivefold Cross Validation

For the LSTM classifier, the first unigram experiment combined with TF-IDF provides accuracy at 73.40%, whereas positive sentiment gives an f1-score of 0.76, a precision score of 0.75, and a recall score of 0.76. Neutral sentiment gives an f1-score of 0.72, a precision score of 0.80, and a recall score of 0.66. Negative sentiment gives an f1-score of 0.73, a precision score of 0.66, and a recall score of 0.81. The second experiment using bag-of-words provides accuracy at 70.66%, whereas positive sentiment gives an f1-score of 0.69, a precision score of 0.68, and a recall score of 0.77. Neutral sentiment gives an f1-score of 0.70, a precision score of 0.71, and a recall score of 0.69. Negative sentiment gives an f1-score of 0.75, a precision score of 0.76, and a recall score of 0.73. The third experiment using Word2Vec provides accuracy at 76.18%, whereas positive sentiment gives an f1-score of 0.74, a precision score of 0.69, and a recall score of 0.79. Neutral sentiment gives an f1-score of 0.69, a precision score of 0.70, and a recall score of 0.68. Negative sentiment gives an f1-score of 0.77, a precision score of 0.79, and a recall score of 0.74. Finally, the BERT model provides accuracy at 79.62%, whereas positive sentiment gives an f1-score of 0.78, a precision score of 0.81, and a recall score of 0.76. Neutral sentiment gives an f1-score of 0.65, a precision score of 0.61, and a recall score of 0.69. Negative sentiment gives an f1-score of 0.79, a precision score of 0.77, and a recall score of 0.81.

The first bigram experiment combined with TF-IDF provides accuracy at 75.75%, whereas positive sentiment gives an f1-score of 0.74, a precision score of 0.80, and a recall score of 0.69. Neutral sentiment gives an f1-score of 0.76, a precision score of 0.84, and a recall score of 0.70. Negative sentiment gives an f1-score of 0.73, a precision score of 0.74, and a recall score of 0.72. The second experiment using bag-of-words provides accuracy at 74.04%, whereas positive sentiment gives an f1-score of 0.75, a precision score of 0.76, and a recall score of 0.74. Neutral sentiment gives an f1-score of 0.75, a precision score of 0.77, and a recall score of 0.73. Negative sentiment gives an f1-score of 0.68, a precision score of 0.71, and a recall score of 0.65. The third experiment using Word2Vec provides accuracy at 76.20%, whereas positive sentiment gives an f1-score of 0.76, a precision score of 0.82, and a recall score of 0.71. Neutral sentiment gives an f1-score of 0.72, a precision score of 0.74, and a recall score of 0.71. Negative sentiment gives an f1-score of 0.74, a precision score of 0.81, and a recall score of 0.68. Finally, the BERT model provides accuracy at 80.75%, whereas positive sentiment gives an f1-score of 0.82, a precision score of 0.83, and a recall score of 0.80. Neutral sentiment gives an f1-score of 0.72, a precision score of 0.72, and a recall score of 0.72. Negative sentiment gives an f1-score of 0.84, a precision score of 0.83, and a recall score of 0.85. The results are shown in Table 11.

Table 11.

Feature	Measure	LSTM
		Unigram			Bigram
		Positive	Neutral	Negative	Positive	Neutral	Negative
TF-IDF	F1-score	0.7579	0.7231	0.7257	0.7445	0.7648	0.7269
	Precision	0.7528	0.7953	0.6574	0.8047	0.8402	0.7369
	Recall	0.7631	0.6630	0.8099	0.6926	0.7019	0.7171
	Accuracy	0.7340			0.7575
Bag-of-words	F1-score	0.6920	0.6978	0.7481	0.7525	0.7458	0.6766
	Precision	0.6789	0.7056	0.7624	0.7612	0.7676	0.7095
	Recall	0.7723	0.6902	0.7343	0.7439	0.7252	0.6467
	Accuracy	0.7066			0.7404
Word2Vec	F1-score	0.7404	0.6882	0.7657	0.7563	0.7222	0.7395
	Precision	0.6934	0.6967	0.7924	0.8149	0.7390	0.8121
	Recall	0.7942	0.6799	0.7407	0.7055	0.7062	0.6788
	Accuracy	0.7618			0.7620
BERT	F1-score	0.7810	0.6499	0.7894	0.8165	0.7200	0.8403
	Precision	0.8094	0.6127	0.7677	0.8326	0.7244	0.8275
	Recall	0.7545	0.6920	0.8124	0.8011	0.7156	0.8534
	Accuracy	0.7962			0.8075

Table 11. Experimental Results of Thai Financial News Sentiment Classification Using LSTM with Fivefold Cross Validation

4.2.2 Second Experiment.

For the second experiment, we used 10-fold cross validation for each classification model, and the results are presented next.

For the naive Bayes classifier, the first unigram experiment combined with TF-IDF provides accuracy at 66.79%, whereas positive sentiment gives an f1-score of 0.71, a precision score of 0.66, and a recall score of 0.77. Neutral sentiment gives an f1-score of 0.64, a precision score of 0.65, and a recall score of 0.62. Negative sentiment gives an f1-score of 0.63, a precision score of 0.65, and a recall score of 0.62. The second experiment using bag-of-words provides an accuracy of 65.56%, whereas positive sentiment gives an f1-score of 0.69, a precision score of 0.64, and a recall score of 0.74. Neutral sentiment gives an f1-score of 0.62, a precision score of 0.64, and a recall score of 0.60. Negative sentiment gives an f1-score of 0.64, a precision score of 0.70, and a recall score of 0.58. The third experiment using Word2Vec provides accuracy at 78.70%, whereas positive sentiment gives an f1-score of 0.68, a precision score of 0.69, and a recall score of 0.67. Neutral sentiment gives an f1-score of 0.57, a precision score of 0.58, and a recall score of 0.56. Negative sentiment gives an f1-score of 0.60, a precision score of 0.62, and a recall score of 0.58. Finally, the BERT model provides accuracy at 70.72%, whereas positive sentiment gives an f1-score of 0.71, a precision score of 0.71, and a recall score of 0.71. Neutral sentiment gives an f1-score of 0.64, a precision score of 0.64, and a recall score of 0.65. Negative sentiment gives an f1-score of 0.70, a precision score of 0.71, and a recall score of 0.69.

The first bigram experiment combined with TF-IDF provides accuracy at 69.85%, whereas positive sentiment gives an f1-score of 0.77, a precision score of 0.67, and a recall score of 0.92. Neutral sentiment gives an f1-score of 0.50, a precision score of 0.64, and a recall score of 0.41. Negative sentiment gives an f1-score of 0.65, a precision score of 0.89, and a recall score of 0.52. The second experiment using bag-of-words provides an accuracy of 67.21%, whereas positive sentiment gives an f1-score of 0.72, a precision score of 0.68, and a recall score of 0.76. Neutral sentiment gives an f1-score of 0.65, a precision score of 0.66, and a recall score of 0.63. Negative sentiment gives an f1-score of 0.64, a precision score of 0.68, and a recall score of 0.60. The third experiment using Word2Vec provides accuracy at 70.01%, whereas positive sentiment gives an f1-score of 0.68, a precision score of 0.70, and a recall score of 0.65. Neutral sentiment gives an f1-score of 0.68, a precision score of 0.67, and a recall score of 0.70. Negative sentiment gives an f1-score of 0.68, a precision score of 0.66, and a recall score of 0.70. Finally, the BERT model provides accuracy at 71.29%, whereas positive sentiment gives an f1-score of 0.69, a precision score of 0.67, and a recall score of 0.71. Neutral sentiment gives an f1-score of 0.63, a precision score of 0.62, and a recall score of 0.64. Negative sentiment gives an f1-score of 0.72, a precision score of 0.74, and a recall score of 0.70. The results are shown in Table 12.

Table 12.

Feature	Measure	Naive Bayes
		Unigram			Bigram
		Positive	Neutral	Negative	Positive	Neutral	Negative
TF-IDF	F1-score	0.7142	0.6351	0.6278	0.7745	0.4999	0.6549
	Precision	0.6639	0.6534	0.6890	0.6667	0.6446	0.8903
	Recall	0.7727	0.6178	0.5766	0.9240	0.4083	0.5180
	Accuracy	0.6679			0.6985
Bag-of-words	F1-score	0.6867	0.6199	0.6366	0.7167	0.6460	0.6377
	Precision	0.6399	0.6417	0.7045	0.6793	0.6600	0.6819
	Recall	0.7408	0.5995	0.5807	0.7584	0.6325	0.5988
	Accuracy	0.6556			0.6721
Word2Vec	F1-score	0.6779	0.5736	0.5997	0.6771	0.6827	0.6765
	Precision	0.6853	0.5833	0.6215	0.7032	0.6710	0.6583
	Recall	0.6706	0.5643	0.5794	0.6528	0.6949	0.6958
	Accuracy	0.6870			0.7001
BERT	F1-score	0.7105	0.6414	0.7034	0.6910	0.6323	0.7174
	Precision	0.7084	0.6357	0.7140	0.6731	0.6212	0.7358
	Recall	0.7126	0.6472	0.6932	0.7098	0.6438	0.6999
	Accuracy	0.7072			0.7129

Table 12. Experimental Results of Thai Financial News Sentiment Classification Using the Naive Bayes Classifier with 10-Fold Cross Validation

For the random forest classifier, the first unigram experiment combined with TF-IDF provides accuracy at 61.08%, whereas positive sentiment gives an f1-score of 0.62, a precision score of 0.68, and a recall score of 0.57. Neutral sentiment gives an f1-score of 0.56, a precision score of 0.49, and a recall score of 0.66. Negative sentiment gives an f1-score of 0.48, a precision score of 0.40, and a recall score of 0.59. The second experiment using bag-of-words provides accuracy at 59.53%, whereas positive sentiment gives an f1-score of 0.45, a precision score of 0.51, and a recall score of 0.40. Neutral sentiment gives an f1-score of 0.44, a precision score of 0.34, and a recall score of 0.64. Negative sentiment gives an f1-score of 0.66, a precision score of 0.63, and a recall score of 0.70. The third experiment using Word2Vec provides accuracy at 62.98%, whereas positive sentiment gives an f1-score of 0.60, a precision score of 0.60, and a recall score of 0.60. Neutral sentiment gives an f1-score of 0.51, a precision score of 0.59, and a recall score of 0.45. Negative sentiment gives an f1-score of 0.65, a precision score of 0.62, and a recall score of 0.69. Finally, the BERT model provides accuracy at 64.17%, whereas positive sentiment gives an f1-score of 0.62, a precision score of 0.67, and a recall score of 0.57. Neutral sentiment gives an f1-score of 0.55, a precision score of 0.55, and a recall score of 0.56. Negative sentiment gives an f1-score of 0.59, a precision score of 0.57, and a recall score of 0.68.

The first bigram experiment combined with TF-IDF provides accuracy at 62.69%, whereas positive sentiment gives an f1-score of 0.52, a precision score of 0.41, and a recall score of 0.70. Neutral sentiment gives an f1-score of 0.50, a precision score of 0.53, and a recall score of 0.63. Negative sentiment gives an f1-score of 0.51, a precision score of 0.64, and a recall score of 0.43. The second experiment using bag-of-words provides accuracy at 61.12%, whereas positive sentiment gives an f1-score of 0.66, a precision score of 0.61, and a recall score of 0.73. Neutral sentiment gives an f1-score of 0.57, a precision score of 0.52, and a recall score of 0.62. Negative sentiment gives an f1-score of 0.59, a precision score of 0.78, and a recall score of 0.47. The third experiment using Word2Vec provides accuracy at 63.10%, whereas positive sentiment gives an f1-score of 0.60, a precision score of 0.61, and a recall score of 0.59. Neutral sentiment gives an f1-score of 0.55, a precision score of 0.58, and a recall score of 0.53. Negative sentiment gives an f1-score of 0.59, a precision score of 0.53, and a recall score of 0.67. Finally, the BERT model provides accuracy at 66.26%, whereas positive sentiment gives an f1-score of 0.70, a precision score of 0.69, and a recall score of 0.70. Neutral sentiment gives an f1-score of 0.55, a precision score of 0.50, and a recall score of 0.69. Negative sentiment gives an f1-score of 0.70, a precision score of 0.70, and a recall score of 0.70. The results are shown in Table 13.

Table 13.

Feature	Measure	Random Forest
		Unigram			Bigram
		Positive	Neutral	Negative	Positive	Neutral	Negative
TF-IDF	F1-score	0.6168	0.5646	0.4759	0.5166	0.5035	0.5144
	Precision	0.6785	0.4932	0.3978	0.4102	0.4190	0.6377
	Recall	0.5654	0.6601	0.5923	0.6976	0.6306	0.4310
	Accuracy	0.6108			0.6269
Bag-of-words	F1-score	0.4452	0.4444	0.6628	0.6625	0.5670	0.5871
	Precision	0.5060	0.3410	0.6269	0.6092	0.5248	0.7763
	Recall	0.3975	0.6377	0.7030	0.7260	0.6167	0.4720
	Accuracy	0.5953			0.6112
Word2Vec	F1-score	0.6001	0.5125	0.6529	0.6018	0.5520	0.5936
	Precision	0.5984	0.5897	0.6201	0.6144	0.5798	0.5309
	Recall	0.6019	0.4531	0.6893	0.5897	0.5268	0.6731
	Accuracy	0.6298			0.6310
BERT	F1-score	0.6193	0.5539	0.5928	0.6947	0.5465	0.6995
	Precision	0.6734	0.5480	0.5726	0.6906	0.4988	0.6968
	Recall	0.5733	0.5600	0.6764	0.6989	0.6942	0.7023
	Accuracy	0.6417			0.6626

Table 13. Experimental Results of Thai Financial News Sentiment Classification Using the Random Forest Classifier with 10-Fold Cross Validation

For the SVM classifier, the first unigram experiment combined with TF-IDF provides accuracy at 72.60%, whereas positive sentiment gives an f1-score of 0.79, a precision score of 0.93, and a recall score of 0.68. Neutral sentiment gives an f1-score of 0.67, a precision score of 0.69, and a recall score of 0.64. Negative sentiment gives an f1-score of 0.79, a precision score of 0.77, and a recall score of 0.80. The second experiment using bag-of-words provides accuracy at 69.09%, whereas positive sentiment gives an f1-score of 0.50, a precision score of 0.63, and a recall score of 0.42. Neutral sentiment gives an f1-score of 0.33, a precision score of 0.48, and a recall score of 0.25. Negative sentiment gives an f1-score of 0.81, a precision score of 0.73, and a recall score of 0.91. The third experiment using Word2Vec provides accuracy at 77.09%, whereas positive sentiment gives an f1-score of 0.78, a precision score of 0.77, and a recall score of 0.78. Neutral sentiment gives an f1-score of 0.69, a precision score of 0.73, and a recall score of 0.65. Negative sentiment gives an f1-score of 0.70, a precision score of 0.69, and a recall score of 0.71. Finally, the BERT model provides accuracy at 82.97%, whereas positive sentiment gives an f1-score of 0.83, a precision score of 0.83, and a recall score of 0.83. Neutral sentiment gives an f1-score of 0.80, a precision score of 0.80, and a recall score of 0.80. Negative sentiment gives an f1-score of 0.84, a precision score of 0.84, and a recall score of 0.84.

The first bigram experiment combined with TF-IDF provides accuracy at 76.39%, whereas positive sentiment gives an f1-score of 0.83, a precision score of 0.90, and a recall score of 0.77. Neutral sentiment gives an f1-score of 0.86, a precision score of 0.91, and a recall score of 0.80. Negative sentiment gives an f1-score of 0.60, a precision score of 0.61, and a recall score of 0.59. The second experiment using bag-of-words provides an accuracy of 74.26%, whereas positive sentiment gives an f1-score of 0.69, a precision score of 0.61, and a recall score of 0.80. Neutral sentiment gives an f1-score of 0.76, a precision score of 0.93, and a recall score of 0.79. Negative sentiment gives an f1-score of 0.59, a precision score of 0.58, and a recall score of 0.60. The third experiment using Word2Vec provides accuracy at 78.13%, whereas positive sentiment gives an f1-score of 0.78, a precision score of 0.77, and a recall score of 0.78. Neutral sentiment gives an f1-score of 0.71, a precision score of 0.72, and a recall score of 0.70. Negative sentiment gives an f1-score of 0.76, a precision score of 0.76, and a recall score of 0.76. Finally, the BERT model provides accuracy at 83.38%, whereas positive sentiment gives an f1-score of 0.84, a precision score of 0.84, and a recall score of 0.83. Neutral sentiment gives an f1-score of 0.84, a precision score of 0.83, and a recall score of 0.85. Negative sentiment gives an f1-score of 0.84, a precision score of 0.83, and a recall score of 0.84. The results are shown in Table 14.

Table 14.

Feature	Measure	SVM
		Unigram			Bigram
		Positive	Neutral	Negative	Positive	Neutral	Negative
TF-IDF	F1-score	0.7857	0.6657	0.7881	0.8311	0.8546	0.5986
	Precision	0.9280	0.6925	0.7741	0.9027	0.9142	0.6096
	Recall	0.6813	0.6409	0.8026	0.7700	0.8023	0.5879
	Accuracy	0.7260			0.7639
Bag-of-words	F1-score	0.5035	0.3281	0.8101	0.6947	0.8559	0.5935
	Precision	0.6306	0.4795	0.7300	0.6123	0.9303	0.5844
	Recall	0.4190	0.2494	0.9100	0.8027	0.7926	0.6028
	Accuracy	0.6909			0.7426
Word2Vec	F1-score	0.7751	0.6877	0.7000	0.7768	0.7074	0.7575
	Precision	0.7714	0.7260	0.6899	0.7724	0.7171	0.7577
	Recall	0.7789	0.6533	0.7105	0.7813	0.6980	0.7573
	Accuracy	0.7709			0.7813
BERT	F1-score	0.8302	0.7999	0.8439	0.8366	0.8390	0.8349
	Precision	0.8308	0.7982	0.8435	0.8394	0.8291	0.8309
	Recall	0.8297	0.8017	0.8443	0.8339	0.8491	0.8390
	Accuracy	0.8297			0.8338

Table 14. Experimental Results of Thai Financial News Sentiment Classification Using SVM with 10-Fold Cross Validation

For the CNN classifier, the first unigram experiment combined with TF-IDF provides accuracy at 72.64%, whereas positive sentiment gives an f1-score of 0.74, a precision score of 0.75, and a recall score of 0.78. Neutral sentiment gives an f1-score of 0.77, a precision score of 0.78, and a recall score of 0.78. Negative sentiment gives an f1-score of 0.65, a precision score of 0.68, and a recall score of 0.62. The second experiment using bag-of-words provides an accuracy of 70.88%, whereas positive sentiment gives an f1-score of 0.71, a precision score of 0.70, and a recall score of 0.71. Neutral sentiment gives an f1-score of 0.69, a precision score of 0.71, and a recall score of 0.68. Negative sentiment gives an f1-score of 0.73, a precision score of 0.75, and a recall score of 0.71. The third experiment using Word2Vec provides accuracy at 75.73%, whereas positive sentiment gives an f1-score of 0.71, a precision score of 0.76, and a recall score of 0.76. Neutral sentiment gives an f1-score of 0.75, a precision score of 0.79, and a recall score of 0.71. Negative sentiment gives an f1-score of 0.81, a precision score of 0.82, and a recall score of 0.80. Finally, the BERT model provides accuracy at 80.09%, whereas positive sentiment gives an f1-score of 0.81, a precision score of 0.80, and a recall score of 0.82. Neutral sentiment gives an f1-score of 0.80, a precision score of 0.79, and a recall score of 0.80. Negative sentiment gives an f1-score of 0.80, a precision score of 0.80, and a recall score of 0.79.

The first bigram experiment combined with TF-IDF provides accuracy at 77.67%, whereas positive sentiment gives an f1-score of 0.75, a precision score of 0.74, and a recall score of 0.76. Neutral sentiment gives an f1-score of 0.72, a precision score of 0.72, and a recall score of 0.72. Negative sentiment gives an f1-score of 0.73, a precision score of 0.69, and a recall score of 0.77. The second experiment using bag-of-words provides accuracy at 75.19%, whereas positive sentiment gives an f1-score of 0.70, a precision score of 0.69, and a recall score of 0.71. Neutral sentiment gives an f1-score of 0.78, a precision score of 0.77, and a recall score of 0.79. Negative sentiment gives an f1-score of 0.74, a precision score of 0.79, and a recall score of 0.70. The third experiment using Word2Vec provides accuracy at 77.20%, whereas positive sentiment gives an f1-score of 0.77, a precision score of 0.77, and a recall score of 0.77. Neutral sentiment gives an f1-score of 0.74, a precision score of 0.77, and a recall score of 0.72. Negative sentiment gives an f1-score of 0.71, a precision score of 0.65, and a recall score of 0.78. Finally, the BERT model provides accuracy at 83.86%, whereas positive sentiment gives an f1-score of 0.84, a precision score of 0.84, and a recall score of 0.84. Neutral sentiment gives an f1-score of 0.82, a precision score of 0.83, and a recall score of 0.82. Negative sentiment gives an f1-score of 0.84, a precision score of 0.79, and a recall score of 0.89. The results are shown in Table 15.

Table 15.

Feature	Measure	CNN
		Unigram			Bigram
		Positive	Neutral	Negative	Positive	Neutral	Negative
TF-IDF	F1-score	0.7384	0.7768	0.6488	0.7529	0.7190	0.7275
	Precision	0.7511	0.7723	0.6768	0.7422	0.7199	0.6914
	Recall	0.7262	0.7813	0.6230	0.7639	0.7182	0.7676
	Accuracy	0.7264			0.7767
Bag-of-words	F1-score	0.7053	0.6938	0.7306	0.6954	0.7817	0.7442
	Precision	0.7030	0.7082	0.7494	0.6913	0.7741	0.7915
	Recall	0.7077	0.6800	0.7128	0.7102	0.7894	0.7022
	Accuracy	0.7088			0.7519
Word2Vec	F1-score	0.7575	0.7474	0.8078	0.7695	0.7425	0.7111
	Precision	0.7577	0.7877	0.8198	0.7670	0.7645	0.6533
	Recall	0.7574	0.7110	0.7962	0.7721	0.7218	0.7801
	Accuracy	0.7573			0.7720
BERT	F1-score	0.8082	0.7961	0.7951	0.8401	0.8240	0.8351
	Precision	0.8008	0.7923	0.8010	0.8415	0.8305	0.7890
	Recall	0.8158	0.7999	0.7893	0.8387	0.8176	0.8869
	Accuracy	0.8009			0.8386

Table 15. Experimental Results of Thai Financial News Sentiment Classification Using CNN with 10-Fold Cross Validation

For the LSTM classifier, the first unigram experiment combined with TF-IDF provides accuracy at 76.33%, whereas positive sentiment gives an f1-score of 0.75, a precision score of 0.73, and a recall score of 0.77. Neutral sentiment gives an f1-score of 0.73, a precision score of 0.77, and a recall score of 0.69. Negative sentiment gives an f1-score of 0.71, a precision score of 0.70, and a recall score of 0.72. The second experiment using bag-of-words provides an accuracy of 74.20%, whereas positive sentiment gives an f1-score of 0.72, a precision score of 0.74, and a recall score of 0.71. Neutral sentiment gives an f1-score of 0.73, a precision score of 0.74, and a recall score of 0.72. Negative sentiment gives an f1-score of 0.73, a precision score of 0.76, and a recall score of 0.70. The third experiment using Word2Vec provides accuracy at 79.77%, whereas positive sentiment gives an f1-score of 0.78, a precision score of 0.78, and a recall score of 0.78. Neutral sentiment gives an f1-score of 0.77, a precision score of 0.79, and a recall score of 0.75. Negative sentiment gives an f1-score of 0.80, a precision score of 0.80, and a recall score of 0.79. Finally, the BERT model provides accuracy at 80.35%, whereas positive sentiment gives an f1-score of 0.81, a precision score of 0.81, and a recall score of 0.82. Neutral sentiment gives an f1-score of 0.80, a precision score of 0.77, and a recall score of 0.82. Negative sentiment gives an f1-score of 0.77, a precision score of 0.70, and a recall score of 0.84.

The first bigram experiment combined with TF-IDF provides accuracy at 78.88%, whereas positive sentiment gives an f1-score of 0.76, a precision score of 0.77, and a recall score of 0.75. Neutral sentiment gives an f1-score of 0.75, a precision score of 0.79, and a recall score of 0.70. Negative sentiment gives an f1-score of 0.77, a precision score of 0.80, and a recall score of 0.74. The second experiment using bag-of-words provides an accuracy of 77.28%, whereas positive sentiment gives an f1-score of 0.77, a precision score of 0.79, and a recall score of 0.76. Neutral sentiment gives an f1-score of 0.76, a precision score of 0.77, and a recall score of 0.74. Negative sentiment gives an f1-score of 0.73, a precision score of 0.75, and a recall score of 0.71. The third experiment using Word2Vec provides accuracy at 80.71%, whereas positive sentiment gives an f1-score of 0.83, a precision score of 0.85, and a recall score of 0.81. Neutral sentiment gives an f1-score of 0.80, a precision score of 0.77, and a recall score of 0.82. Negative sentiment gives an f1-score of 0.80, a precision score of 0.80, and a recall score of 0.81. Finally, the BERT model provides accuracy at 84.07%, whereas positive sentiment gives an f1-score of 0.83, a precision score of 0.82, and a recall score of 0.84. Neutral sentiment gives an f1-score of 0.80, a precision score of 0.76, and a recall score of 0.86. Negative sentiment gives an f1-score of 0.84, a precision score of 0.85, and a recall score of 0.83. The results are shown in Table 16.

Table 16.

Feature	Measure	LSTM
		Unigram			Bigram
		Positive	Neutral	Negative	Positive	Neutral	Negative
TF-IDF	F1-score	0.7464	0.7255	0.7093	0.7607	0.7450	0.7713
	Precision	0.7257	0.7712	0.6969	0.7743	0.7904	0.8019
	Recall	0.7683	0.6850	0.7222	0.7476	0.7045	0.7430
	Accuracy	0.7633			0.7888
Bag-of-words	F1-score	0.7231	0.7301	0.7275	0.7715	0.7576	0.7313
	Precision	0.7404	0.7393	0.7605	0.7857	0.7731	0.7528
	Recall	0.7066	0.7211	0.6972	0.7579	0.7428	0.7109
	Accuracy	0.7420			0.7728
Word2Vec	F1-score	0.7786	0.7684	0.7949	0.8268	0.7956	0.8036
	Precision	0.7767	0.7915	0.7980	0.8479	0.7698	0.8014
	Recall	0.7806	0.7467	0.7918	0.8067	0.8232	0.8058
	Accuracy	0.7977			0.8071
BERT	F1-score	0.8125	0.7971	0.7645	0.8293	0.8029	0.8385
	Precision	0.8071	0.7715	0.6996	0.8200	0.7567	0.8472
	Recall	0.8180	0.8244	0.8427	0.8389	0.8551	0.8299
	Accuracy	0.8035			0.8407

Table 16. Experimental Results of Thai Financial News Sentiment Classification Using LSTM with 10-Fold Cross Validation

4.2.3 Comparison of Results.

After both experiments, we compared all obtained results for performance comparison of these algorithms, and the summarized results of accuracy from all experiments are shown in Table 17. We summarize the results as follows:

(1)

The naive Bayes classifier had higher overall accuracy than random forest. The best results were obtained in the BERT model rather than TF-IDF, bag-of-words, and Word2Vec. In contrast, bigram helped naive Bayes improve sentiment classification performance better than unigram.

(2)

The random forest classifier had the lowest accuracy among others. The obtained results did not differ in each experiment. Random forest was the worst sentiment classification to classify Thai language sentiment.

(3)

SVM had the highest accuracy over other machine learning algorithms. The obtained results from the second experiment were outstanding when compared to the first experiment. We recommend using SVM to classify Thai language sentiment in terms of the machine learning approach.

(4)

CNN had improved accuracy over machine learning methods. The results obtained from the second experiment differed slightly from the first experiment. However, CNN had lower efficiency than LSTM.

(5)

LSTM had the highest accuracy when compared to others. The obtained results from both experiments were outstanding when compared to other methods. Using a deep learning approach, we recommend using LSTM to classify Thai language sentiment.

Table 17.

Feature	Measure	Algorithm
		Naive Bayes		Random Forest		SVM		CNN		LSTM
		Unigram	Bigram	Unigram	Bigram	Unigram	Bigram	Unigram	Bigram	Unigram	Bigram
TF-IDF	5-fold cross validation	0.6337	0.6829	0.5555	0.5756	0.6985	0.7058	0.7190	0.7272	0.7340	0.7575
	10-fold cross validation	0.6679	0.6985	0.6108	0.6269	0.7260	0.7639	0.7264	0.7767	0.7633	0.7888
Bag-of-words	5-fold cross validation	0.6337	0.6829	0.5410	0.5607	0.6775	0.7003	0.6980	0.7147	0.7066	0.7404
	10-fold cross validation	0.6556	0.6721	0.5953	0.6112	0.6909	0.7426	0.7088	0.7519	0.7420	0.7728
Word2Vec	5-fold cross validation	0.6699	0.6706	0.6027	0.6098	0.7425	0.7798	0.7766	0.7989	0.7618	0.7620
	10-fold cross validation	0.6870	0.7001	0.6298	0.6310	0.7709	0.7813	0.7573	0.7720	0.7977	0.8071
BERT	5-fold cross validation	0.6972	0.7011	0.6069	0.6118	0.7876	0.7891	0.7827	0.8064	0.7962	0.8075
	10-fold cross validation	0.7072	0.7129	0.6417	0.6626	0.8297	0.8338	0.8009	0.8386	0.8035	0.8407

Table 17. Summarized Results of Accuracy from All Experiments

5 Discussion

We applied five algorithms to the same dataset. These results are described for performance comparison of the algorithms as follows:

(1)

The naive Bayes classifier performed better than random forest in terms of higher accuracy, f1-score, precision, and recall. However, this classifier was the second worst compared to SVM, CNN, and LSTM.

(2)

The random forest classifier performed the worst in terms of lowest accuracy, f1-score, precision, and recall when compared with other algorithms.

(3)

SVM was the best sentiment classification tool for machine learning techniques, with the highest accuracy, f1-score, precision, and recall. However, SVM could be better performance when compared to other machine learning techniques.

(4)

CNN performed better than machine learning techniques, with a high accuracy, f1-score, precision, and recall. However, CNN had the second-best performance when compared to LSTM.

(5)

LSTM had the best performance compared to others, with the highest accuracy, f1-score, precision, and recall.

In summary and remain all paragraph of Thai financial news, we categorized and applied sentiment classification algorithms using supervised machine learning and deep learning approaches. The results showed that all algorithms obtained better performance with bigram, the BERT model, and 10-fold cross validation than unigram, bag-of-words, and 5-fold cross validation.

Additionally, we wanted to find the best sentiment classification tool for Thai language sentiment. The results showed that SVM has the best classifier performance for the machine learning approach, whereas LSTM has the best classifier performance for the deep learning approach.

6 Conclusion

In this work, we conducted an analysis and classified Thai financial news. We categorized and applied sentiment classification algorithms using supervised and semi-supervised approaches. The results showed that SVM has the best classifier performance in the supervised machine learning approach. In contrast, LSTM has the best classifier performance with regard to the semi-supervised deep learning approach for Thai sentiment.

The limitations of this research include that the first datasets were financial news in the Thai language. Second, we retrieved headline news from Twitter only. Third, we only used machine learning and deep learning techniques to classify financial news sentiment. Finally, we found that the complex structure of the Thai language may decrease sentiment classification performance.

Our study could contribute to commercialization instruments such as product/service feedback and investor sentiment. For future work, it is our hope that sentiment classification will improve classification performance considering the complexity of the Thai language.

Acknowledgments

We would like to acknowledge the research and editing support provided by Chulalongkorn University, and we thank the Python programming developer who created the Python package to convenience us in classifying financial news sentiment.

References

[1]

Xiaodong Li, Haoran Xie, Li Chen, Jianping Wang, and Xiaotie Deng. 2014. News impact on stock price return via sentiment analysis. Knowledge-Based Systems 49, 1 (Oct. 2014), 14–23.

Abstract

1 Introduction

2 Literature Review

2.1 Characteristics of the Thai Language

2.2 Thai Word Sentiment

2.3 Natural Language Processing

2.4 Sentiment Classification Techniques

2.4.1 Machine Learning Approach.

2.4.2 Lexicon-Based Approach.

2.5 Related Works

3 Research Methodology

3.1 Data Collection

3.2 Data Preprocessing

3.2.1 Data Cleansing.

3.2.2 Tokenization.

3.3 Feature Extraction

3.3.1 Bag-of-Words.

3.3.2 Term Frequency–Inverse Document Frequency.

3.3.3 Word Embedding.

3.3.4 N-gram.

3.4 Feature Vector

3.5 Sentiment Classification

3.5.1 Naive Bayes Classifier.

3.5.2 Random Forest.

3.5.3 Support Vector Machine.

3.5.4 Convolutional Neural Network.

3.5.5 Long Short-Term Memory.

3.6 Classifier Evaluation

3.6.1 Inter-Annotator Agreement.

3.6.2 Confusion Matrix.

3.6.3 Measurement Factors.

3.6.4 Experiment Settings.

4 Research Methodology

4.1 Wordcloud

4.2 Classifier Model Evaluation

4.2.1 First Experiment.

4.2.2 Second Experiment.

4.2.3 Comparison of Results.

5 Discussion

6 Conclusion

Acknowledgments

References

Cited By

Index Terms

Recommendations

Sentiment labeling for extending initial labeled data to improve semi-supervised sentiment classification

Semi-Supervised Latent Dirichlet Allocation and Its Application for Document Classification

Building an Arabic Sentiment Lexicon Using Semi-supervised Learning

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations