1. Introduction
The rapid advancement of technology has impacted every aspect of human life. Human–computer interaction, particularly with the development of web technologies, has reached new dimensions through information gathering and dissemination centers such as social media, online shopping sites, news, sports, and magazine platforms. This interaction has resulted in the generation of vast amounts of data. With the increasing volume of primarily textual data, including electronic documents, web pages, messages, etc., the organization, retrieval, and processing of these extensive textual data have become significant challenges. In this context, automatic text classification, often referred to as text categorization, stands out as one of the most widely-used technologies for addressing these purposes. Text classification is the process of assigning predefined categories or classes to textual data [
1]. This process is carried out using machine learning algorithms to analyze the content of textual data and make decisions based on this content. Text classification helps organize, analyze, and comprehend textual data, involving various operations and processes. At the forefront of these processes is term/feature weighting [
2,
3,
4].
Term weighting is defined as the systematic process of increasing the value of a specific term or terms to give them more importance in analysis or computations. In the weighting process, the weight for each term in the documents is calculated using a term weighting algorithm. The primary objective is to highlight the difference between terms that provide distinct and specific information for classification and those that are commonly found across all documents, carrying no specific information. This way, a more effective and efficient classification process can be achieved. However, the challenges of high dimensionality and sparsity in text data hold a significant place in the term weighting stage. In the literature, representing document contents with multidimensional feature vectors is referred to as the vector space model (VSM) [
5]. Considering terms in text data as features, leads to a highly sparse structure in the vector space model, resulting in a high-dimensional term space. One of the most significant research problems in text classification is to represent the relationship between extracted features and documents as effectively as possible in the vector space model, despite the sparsity structure. In this process, the term weighting operation comes into play, and numerous studies have been presented on this matter. For instance, Debole and Sebastiani [
6] initially proposed the idea of supervised term weighting (STW), which involves weighting terms based on known categorical information in the training data. They introduced three STW schemes: TF-CHI, TF-IG, and TF-GR. These schemes replaced the global factor IDF in TF-IDF with feature selection functions like χ
2 statistic (CHI), information gain (IG), and gain ratio (GR). Similarly, Emmanuel et al. [
7] expressed that the positive contribution of a feature to a category could be obtained by calculating its negative contributions to other categories and proposed the PIF (Positive Impact Factor) method for term weighting. Ren and Sohrab [
8] have proposed two new term weighting schemes for text classification, named TF-IDF-ICF and TF-IDF-ICSDF. These schemes utilize, in addition to the TF-IDF information of terms, the inverse class frequency (ICF) and inverse class space density frequency (ICSDF) information, respectively, in the weighting process. The mentioned researchers emphasized that the TF-IDF-ICSDF term weighting scheme, particularly due to providing positive distinctiveness for both frequently and infrequently occurring terms, has shown promising results. Moreover, one of the most advanced algorithms in the recently published field of sentence extraction is the YAKE! model [
9]. The model utilizes the concept of feature-based term weighting to measure the relevance of terms. The model also focuses on the position of stop words in the document to extract expressions.
Despite the successful term weighting schemes proposed in recent years, the ongoing introduction of new schemes in this field suggests that there is room for developing new schemes with weighting strategies that better reflect the discrimination potential of terms. Regardless of how up to date the proposed methods for term weighting are, there are instances where each of them may be insufficient in the weight calculation process, ignoring or failing to produce reasonable weights for terms in some extreme scenarios due to their weighting strategies. Therefore, this paper presents a new weighting model called Rough Multivariate Weighting Scheme (RMWS) and its mathematical derivative, Square Root Rough Multivariate Weighting Scheme (SRMWS). This study aims to provide the best representation by revealing hidden patterns in text data. For this purpose, the proposed scheme uses rough sets to reveal documents with special information. Firstly, with the help of rough sets, documents containing special information on a term basis are identified. Then, the distribution relation of the term between the classes is revealed, and discrimination for the terms is determined based on these relations. Finally, term weight values are calculated depending on the distinctiveness of the terms. After determining the most distinctive features, classification algorithms are employed to evaluate the performance of the RMWS and SRMWS methods. In this study, for this purpose, Support Vector Machines (SVMs), K-Nearest Neighbors (KNNs), and Naive Bayes (NB) classifiers were utilized. Additionally, in experimental studies, the classification performance of the proposed method was compared with existing methods in the literature using evaluation criteria such as macro-F1 and micro-F1.
The flow of the rest of this study is as follows:
Section 2 offers insights into related works and explores the background of the approaches used for comparison. In
Section 3, a preliminary analysis of Rough Set is provided.
Section 4 introduces the RMWS method, and
Section 5 details the experimental work along with the obtained results. Finally, after including discussions about the study in
Section 6, the study is concluded with
Section 7.
2. Related Works
Term weighting processes are crucial tools to enhance the processes of extracting meaning and retrieving information from text documents. The selection of term weighting methods may vary depending on the purpose of use and the dataset under consideration. Depending on the dataset’s utilization of class information, supervised or unsupervised methods are employed. Additionally, the preferred method may vary when dealing with a dataset that is binary or multiclass. Therefore, the literature includes weighting techniques with different working mechanisms. In this section, information about these techniques will be provided. Since there are common expressions and preliminaries in the weighting equations of these techniques, detailed information about these expressions and preliminaries is provided in
Table 1 (for common expressions) and
Table 2 (for preliminaries).
The term frequency (TF) and term frequency–inverse document frequency (TF-IDF) methods, known as traditional term weighting schemes, are binary weighting methods that arise from information extraction [
11]. Frequently occurring words in text data can increase computational costs. For example, the word “the” in English is commonly used, and its frequency is high; this means it is present in all documents. Such words often have low discriminative features in text classification processes and may, therefore, need to be excluded. To cope with this problem, the TF-IDF method is employed. This method calculates a score by considering the frequency of a word in a particular document and its frequency across all documents. Using these scores, unique words representing important information in a specific document are identified and extracted from the text. Consequently, the IDF (inverse document frequency) value of a rarely-occurring word will be high, while the IDF value of a frequently-occurring word will be low. Therefore, in the TF-IDF method, by calculating the inverse document frequency values, low scores are assigned to common terms in the text collection, and high scores are assigned to rare terms. Mathematically expressed, TF-IDF [
12] is calculated as follows:
The supervised weighting method TF-PB, which utilizes class-internal and class-external possibility distributions, is an effective method for imbalanced datasets and binary classification [
11]. This method calculates weighting as expressed in Equation (2) below.
TF-RF is a weighting method used to measure the importance of terms in a document [
13]. This method is a supervised learning technique, particularly employed in binary classification problems. It focuses on the frequency of occurrence of terms in positive and negative categories. Equation (3) below illustrates the TF-RF weighting formula.
TF-IDF-ICF is a supervised method that utilizes data related to the total number of documents and classes in which terms appear. Using this weighting method, the weight values of terms are obtained by multiplying the TF-IDF weight values for each term by ICF values [
8]. Equation (4) represents the mathematical calculations.
Weighting method TF-IDF-ICSDF is a supervised scheme that calculates the weight values of terms by multiplying the TF-IDF weight value of a term by ICSDF [
8]. The key difference between the formula of this method and the previous one lies in considering not only the total number of documents in each class, where each term appears, but also the ratio of the number of documents in that class to the total number of documents in that class. Equation (5) includes the relevant weighting formula.
TF-TRR is a supervised term weighting method that utilizes the distributions of positive and negative classes to accurately weigh terms for binary classification. In the TF-TRR weighting method, the TF value is used to determine how frequently a term appears in a document [
14]. The TRR value is then used to assess the relevance of a term to the subject of the document. The mathematical representation is given by Equation (6).
TF-IGM is a recently-proposed supervised weighting method for multiclass classification that provides weighting. Terms are calculated using the Inverse Geometric Moment (IGM) [
15]. This method computes IGM by counting the number of documents in which each term occurs at least once for each class. These counts are then sorted in ascending order from the largest to the smallest. Equation (7) shows the mathematical formula used to calculate the IGM value for a term.
In this formula,
represents the frequency of the term’s class-based document frequency. In other words,
indicates the number of text documents containing the term
in the
r-th category, arranged in descending order. The TF-IGM weight of a term is calculated as shown in Equation (8).
In this formula, λ is an adjustable constant typically defined in studies within the range of 5.0 to 9.0. Moreover, its default value is 7.0.
TF-IGM
imp attempts to address such weighting issues by incorporating
into the standard IGM formula, as shown in Equation (9). Terms are calculated using the Improved Inverse Geometric Moment (IGM) [
16].
is the number of total documents in the class in which occurs most, and is the number of documents in the class in which t occurs most. also corresponds to in TF-IGM.
A variety of modifications and enhancements were discussed earlier to improve the performance of the TF-IDF scheme [
17]. These changes are generally categorized as supervised and unsupervised methods. However, beyond these categories, there are also term weighting approaches with different working principles, known as vector-based term weighting [
18]. Numerous term weighting schemes based on the vector concept have been proposed. Many of these models use n-grams to enhance algorithms in terms of understanding the semantics of a document [
19,
20,
21,
22,
23]. However, in such approaches, as the word tree grows, the term space also expands, resulting in the widespread issue of high dimensionality. This, in turn, requires higher computational power and increases time complexity. Reducing dimensions in the high dimensionality problem makes the system lighter and more easily usable without excessive computational requirements. Consistently considering high dimensions may not be advantageous, as it can lead to uncertain results. For these reasons, the TF-IDF modification techniques mentioned earlier are often preferred. Indeed, the method proposed in this study is also a supervised TF-IDF multiclass approach.
3. Rough Set Theory
Rough set theory (RST) [
24] is a mathematical approach for effective inference from incomplete and inconsistent data, uncovering hidden patterns without requiring additional information like membership functions. This makes it distinct from methods like Fuzzy Logic and Dempster–Shafer Theory. Widely applied in fields such as data mining, pattern recognition, and text mining, RST supports tasks like classification, rule generation, feature selection, and dimension reduction independently [
25,
26,
27]. RST operates by organizing uncertain data into rough sets and extracting approximate values for concepts. Its foundation lies in classifying relational databases to generate concepts and rules while identifying equivalence relations for further information discovery. Unlike Fuzzy Set Theory, which depends on membership functions with inherent uncertainty, RST uses precise boundary definitions to address uncertain problems. Key concepts of RST include the information system, which represents raw data collected from various fields. When the system includes decision attributes, it is termed a decision table; otherwise, it is an information table. Mathematically, let
represent a decision table or information system, where
denotes the universal set consisting of objects,
denotes a conditional attribute set, and
denotes a decision attribute set. If
, the system
is referred to as a decision table; otherwise, it is expressed as an information table.
Table 3 provides an example of a decision table.
The indiscernibility relation or discernibility relation determines the similarity or difference between objects in a knowledge system based on a subset of attributes. For any conditional attribute subset
, denoted as
, the
is defined as follows:
In the formula, the equivalence classes of the are represented as .
Rough set theory introduces lower (
) and upper (
) approximations to analyze subsets
using attributes
, as follows:
These approximations help identify regions within rough sets, distinguishing certain membership (
) from probabilistic membership (
). For example, considering the decision table shown in
Table 3, let
and
. In this case, according to Equation (3), the pair
would be as follows:
An accuracy measure of the set
for T
is defined as follows:
It reflects the determinability of set within , ranging from 0 to 1. For the above example, , indicating partial determinability.
4. Proposed Method: Rough Multivariate Weighting Scheme (RMWS)
In this section, a novel term weighting approach called Rough Multivariate Weighting Scheme (RMWS) is introduced, along with its mathematical derivative, the Square Root Rough Multivariate Weighting Scheme (SRMWS).
Text documents generally represent unstructured datasets. However, to enable processing by a classifier model, unstructured text data need to be transformed into a structured feature space. Creating a system that best represents the content of each document is crucial in this complex task. The vector space model is the most common method used for this purpose. The goal is to ensure that the vector space model effectively represents the dataset. Researchers are exploring various solutions to effectively represent document vectors, which is a significant challenge. In this context, when determining the relationships between the content of documents and terms, assigning appropriate weights to terms is a critical step. Therefore, there is a need for an effective term weighting scheme that assigns reasonable weights to terms based on their classification capabilities. While there are many term weighting schemes in the literature, it is challenging to claim that these schemes ideally reflect the true distinctive abilities of terms. For instance, the word “cat” may be more important in a text related to the “animal” category, while the conjunction “and” may not be as significant. Techniques that consider class information, such as “supervised” term weighting methods, can provide higher classification accuracy. These methods can better capture the importance of terms in different categories by assigning a separate weight for each term in each class. However, it can be argued that supervised weighting methods have not yet fully achieved effective representation of the class–term relationship, and ongoing research in this area indicates the need for further improvement. Therefore, effectively revealing the class–term relationship and identifying documents containing specific information related to this relationship form one of the most important foundations and motivations of this study. Thus, this study aims to provide the best representation by uncovering hidden patterns in text data, using proven rough set methods for successful extraction of hidden patterns from data. With the help of rough sets, documents containing specific information on a term-by-term basis have been identified. In text data, certain documents may contain indicative terms related to a person, field, topic, or object. The presence of these terms provides specific information. In this study, these specific patterns have been obtained using rough sets. The indiscernibility relation in rough sets has been utilized to select documents with distinctive frequencies for specific terms. The indiscernibility relation for documents related to a term is expressed as follows:
In the equation,
represents the document space, where
and
denote the
-th and
-th documents, respectively. Similarly,
T represents the term space, and
represents the
r-th term. Using the indiscernibility relation, equivalence classes denoted by
are obtained, representing sets of documents. These equivalence classes include documents that contain specific terms in the document–term space, providing significant distinctive information. To determine how much information the equivalence classes offer, a subset approach is applied to the equivalence classes. For a document subset
, the document subset approximation
is expressed as follows:
The information value provided by the determined subset
depends on its ratio within the class. This ratio is referred to as Rough Rate (RR) in this study and is formulated as follows:
In Equation (15), represents the documents belonging to class . After determining the document ratio providing specific information on a class basis over the equivalence classes, the distinctiveness of each term is determined. The distinctiveness of a term depends on some fundamental criteria:
A term that is frequently seen in a single class and not observed in other classes is considered distinctive.
A term that appears in some classes is relatively distinctive.
To reveal these pieces of information, the distribution among the classes of the term is examined. For this purpose, first, the possibility of the term within the class () is calculated, and then the possibility of the term not occurring within the class () is computed.
Some coefficients are needed to fully reveal the distinctiveness of the term. Three constant coefficients were used in this study. Two of them are as follows:
: It is used to balance the relationship between and . The calculated values of and may differ significantly. This coefficient is employed to make the impact of more comprehensible.
: The value of may be zero. In such cases, this coefficient is used to enable the calculation.
In accordance with this information, the distinctiveness of a term is calculated using Equation (16) presented in this study.
After calculating the distinctiveness of the term, the weight calculation for the term is performed according to the proposed approach, as follows:
Taking the square root of the
value in Equation (17) yields the SRMWS value. Accordingly, SRMWS is calculated using Equation (18) below.
In Equation (18), represents the third coefficient.
At this point, it is important to clarify that SRMWS is not an entirely new method; rather, it is a variation of the proposed approach, obtained by taking the square root of the term frequency (TF) values. This is analogous to the relationship between the TF-IGMimp [
16] method and its derivative, SQRT_TF-IGMimp.
An Illustrative example:
The working principle of RMWS has been demonstrated through the application on a simple text document, represented by
Table 4. The provided simple document collection in
Table 4 illustrates the operational principle of RMWS.
For the given collection of simple documents, the initial calculation involves determining the distinctiveness or importance of each term. The RMWS values for each term are computed using Equation (16) and, in this example, the values of all three coefficients (
,
,
) have been set to 1.
Analyzing the distinctiveness of terms reveals that term
has the highest distinctiveness, while the lowest value is assigned to
. This distinction is due to
being exclusively present in class c3 and occurring with a specific majority, warranting a high score. Conversely,
occurs in equal amounts across all three classes, leading to a lower score. Additionally, the distinctiveness of
is only slightly less than that of
. While
occurs in classes
and
,
appears in classes
and
. The difference lies in
being more concentrated in one class (
) compared to
, making it more distinctive and assigning it a higher score. Furthermore, intuitively, it is clearly evident that the distinctiveness of these terms aligns with the computation performed by RMWS.
Table 5 provides a summary of these evaluations.
Now, the weight set for each term can be calculated. For this purpose, the calculated term weights are used to update the value set for the terms, according to Equation (17).
After the weighting of the data is completed, the remaining step is to observe the impact of these weightings on the classifier. Finally, the illustrative representation of the working principle of the proposed approach related to RMWS is provided in
Figure 1.
5. Experimental Works
In this section, we present the outcomes of our experimental endeavors. Initially, we provide concise details about the employed datasets, followed by an elucidation of the chosen success metrics and used classifiers. Then, various tests have been conducted to determine suitable values for constant coefficients. Subsequently, we assess the impact of terms weighted by proposed methods on the efficacy of classifiers. For this purpose, we compare the performance of proposed methods with the term weighting methods outlined in the preceding sections. The performance of classifiers using terms weighted by their corresponding term weighting algorithms is discussed in this section. Lastly, a set of statistical analyses is incorporated into the conclusion of this section to ascertain whether the performance enhancement achieved by the proposed methods is statistically significant in comparison to other methods.
5.1. Datasets
Within the scope of experimental studies, three different reference datasets have been used for text classification: Reuters-21578, 10 Mini Newsgroups, and Enron1. These collections have been preferred due to their characteristics, such as balanced and unbalanced or multiple and binary classes. In other words, the advantage of this diversity has been utilized to make a fair comparison among the proposed term weighting methods.
Reuters-21578 is a collection of documents published on the Reuters news channel in 1987. The documents have been compiled and indexed into categories. Additionally, the dataset includes the first ten classes of the well-known Reuters ModApte split [
28], widely used in many text classification research studies. This dataset is termed unbalanced because it contains a different number of documents in each class and is multiclass. In the context of this study, experiments were conducted on the training and test splits of Reuters-21578. During the feature extraction process, multi-labeled documents were removed from the Reuters-21578 data, and subsequently, the two classes named ’wheat’ and ’corn’ were deleted because these two classes became empty. Further details about the Reuters-21578 dataset are presented in
Table 6.
The 20 Newsgroups dataset [
28] is divided into two subsets, totaling approximately 18,000 newsgroups and covering 20 different topics: one for training or development and the other for testing or performance evaluation. The division between the training and test sets is based on messages sent before and after a specific date. The 20 Mini Newsgroups dataset used in this study is a mini subset of the popular text collection 20 Newsgroups, containing ten different classes. This dataset has a balanced structure, meaning the number of documents in each class is equal, and it is multiclass. In the experiments, the dataset was manually divided into training (70%) and test sections (30%). Detailed information about the
10 Mini Newsgroups dataset is provided in
Table 7.
The Enron–Spam dataset, a source described in the publication ’Spam Filtering with Naive Bayes—Which Naive Bayes?’ by V. Metsis, I. Androutsopoulos, and G. Paliouras [
29], was collected by the mentioned authors. The dataset contains a total of 17,171 spam and 16,545 non-spam email messages (a total of 33,716 emails). In this study, a subset of the Enron–Spam dataset, named Enron1, was used. This dataset is imbalanced as it contains a different number of documents in each class. Additionally, it is used for binary classification as it consists of only two classes. Content information related to Enron1 is presented in
Table 8.
5.2. Assessment of Performance
This study employed micro-F1 and macro-F1 scores as key performance metrics to assess the efficacy of feature selection methods. The F1 score, incorporating both precision and recall, was utilized in the evaluation. In the macro-averaging approach, the F1 score is individually calculated for each class, and subsequently, the mean across all classes is determined [
30]. The computation of the macro-F1 score is illustrated below in Equation (19).
In this equation, and represent the precision and recall scores of class j, respectively.
Conversely, the F1 score is computed in micro-averaging without considering class-specific information. Therefore, all classification decisions are taken into account across all corpora. In the evaluation of imbalanced datasets, the micro-averaging approach may result in the dominance of large classes over small ones. However, this scenario might not be applicable to balanced datasets, where the number of documents in each class is equal, and the feature counts are similar. The calculation of the micro-F1 score is illustrated below in Equation (20).
where
p and
r represent precision and recall values across all classes. The micro-F1 score, influenced by the prevalence of larger classes with more documents, might not ensure a fair assessment in all scenarios. Consequently, to achieve a more unbiased evaluation, the micro-F1 score is preferred for balanced datasets, whereas the macro-F1 score is employed for imbalanced datasets. In this study, a diverse range of datasets, including both balanced and imbalanced ones, were employed. Thus, the experiments utilized both micro-F1 scores and macro-F1 criteria to ensure comprehensive evaluation.
5.3. Classifiers
The proposed method, RMWS, is not dependent on the learning model as it is a term weighting technique. Therefore, to explore the impact of the features incorporated into the ultimate feature set on classification accuracy, four distinct classifiers were utilized in the experimental phase of the study. Concise explanations of the classifiers employed are outlined in
Table 9.
5.4. Coefficient Analysis
This study utilizes three adjustable coefficients, akin to a potentiometer, to achieve specific balances. To set these coefficients to appropriate values and observe their impact on the results, micro-F1 and macro-F1 results for SVM, KNN, and NB classifiers were obtained on the Reuters_21578 dataset with term sizes of 500, 1000, and 2000. The results obtained are presented in
Table 10,
Table 11 and
Table 12. Furthermore, the best-performing results presented in the tables are emphasized using bold font to facilitate the reader’s focus on the most significant outcomes.
When examining the tables,
Table 10 has been created to analyze the effect of the
coefficient. In this table,
and
values are kept constant while the
α value is varied. The
α value is tested for random values greater and less than 1,
}. Accordingly, the
coefficient has negligible impact on the KNN classifier, whereas it exerts a significant influence on the NB classifier. For the SVM classifier, this coefficient has a relatively modest effect. The performance of the NB classifier is notably enhanced when the value of this coefficient exceeds 1. For instance, with
and 500 terms, the macro-F1 value is 57.5188%, while for
, it increases to 87.7820%. This implies a positive enhancement in the performance of the NB classifier when the
coefficient is greater than 1. However, a careful examination of
Table 2 reveals a decline in results when this value is 3. Therefore, it is necessary for the
value to be within a certain range. The results show that
is a reasonable value.
Table 11 was created to analyze the
value. In this table, other coefficients are fixed, and the
variable has random values:
. The
coefficient generally appears to positively influence results when its value is high for this dataset. This coefficient is a constant used within the method to adjust its contribution to uncovering hidden patterns within the dataset. A high value signifies the importance of rough clustering in revealing specific information. In this analytical study, the value was set at a maximum of 2.5, and optimal results were obtained for all classifiers at this value. It indicates that taking values slightly higher than 2.5 for the
parameter, and in proportion to it, will improve the results.
Table 12 was created to show the effect of the
coefficient. The
random values were taken for
. The
value is a constant given to facilitate calculations in cases where the denominator in the method equation is zero. Therefore, values less than 1 have been assigned. This coefficient is observed to vary on a class-by-class basis.
As a result, the coefficients predicted to be ideal for the Features dataset are provided in
Table 13 below. The fixed values mentioned in the table were used in the experimental section of this study.
Again, when the tables created for coefficient analyses and
Table 13 are examined, the relationship between coefficients can be analyzed on a classifier basis. To perform this analysis, the binary interaction effects of coefficients were examined on a classifier basis. Accordingly, binary coefficient relationship graphs for each classifier are given in order in
Figure 2,
Figure 3 and
Figure 4.
In order to see whether the values given in
Table 13 would have an effect on the results as stated, some more tests were performed. In these tests, the coefficient values given for each classifier in
Table 13 were compared with the best results for the same classifier in
Table 10,
Table 11 and
Table 12. The results obtained are given in
Table 14 and the best results are highlighted in bold within the table.
Table 14 shows that the proposed methods give better results at the coefficient values determined on a classifier basis. This situation indicates both the necessity and importance of the analyses performed.
Note: Similar procedures have been applied to other datasets in this study. The determined coefficient values have yielded comparable results in these datasets as well. Therefore, the values in
Table 13 have been used for all datasets and accepted as default values for the proposed methods.
5.5. Accuracy Analysis
This section presents a comprehensive comparison of the proposed term weighting methods with established approaches, namely TF-IDF, ICF, ICSDF, TRR, IGM, SIGM, IGMimp, and SIGMimp. The performance evaluation is conducted on various term dimensions: 750, 1500, 2500, 3750, 4500, 5750, 6500, and 7750. The Distinguishing Feature Selector (DFS) [
31] algorithm is utilized for term selection within the dimension selection process. Term selection methods are frequently used to address the high dimensionality of term spaces and to assess the effectiveness of proposed techniques in text classification. The DFS approach is preferred in this study because it provides effective results. This study investigates the performance of SVM, KNN, and NB classifiers in terms of macro-F1 and micro-F1 criteria on the weighted term dimensions. The obtained results are presented in separate graphs for each dataset.
Figure 5 presents a comparative analysis of the term weighting methods alongside RMWS, SRMWS, and eight additional approaches from the literature, based on micro-F1 and macro-F1 criteria, using the SVM classifier on the Reuters_21578 dataset. As observed in the figure, the SRMWS method outperforms all other methods in terms of both micro-F1 and macro-F1 scores across all term dimensions. Similar to RMWS and SRMWS, SIGM and SIGMimp represent the square root variants of IGM and IGMimp approaches, respectively. Another key observation from
Figure 5 is that RMWS emerges as the second-best approach after SIGM and SIGMimp. Notably, RMWS even surpasses these approaches in the 4500-dimensional setting, according to the micro-F1 criterion. RMWS has demonstrated a remarkable superiority over non-variant methods, showcasing its potential to make a significant mark in the field of term weighting. These findings can contribute to the development of text classification algorithms and lay a crucial foundation for future research. This finding demonstrates that the proposed term weighting model offers a more effective weighting strategy compared to existing models in the literature. The weight scores assigned to terms also serve as a measure of their discriminative power.
Figure 6 presents the results obtained using the KNN classifier. It can be observed that the SRMWS method achieves the same result as SIGM and SIGMimp for micro-F1 at the 750th dimension, while outperforming them in all other dimensions. Similarly, for macro-F1, it achieves the best result in all dimensions, except for the 7750th dimension, where it is equal to SIGM and SIGMimp. Among the non-variant methods, RMWS shows the highest performance for macro-F1. For micro-F1, it shows the highest performance in all dimensions except for 3750 and 4500.
Figure 7 presents the performance of term weighting algorithms for the NB classifier. It can be observed that the RMWS and SRMWS methods yield the same results across all term sizes and demonstrate superior performance compared to other approaches. Additionally, it is noted that algorithms derived from the square root of TF values (variants) also produce similar results to their parent algorithms.
Figure 8 shows that the SRMWS method outperforms all other methods in all dimensions for both micro-F1 and macro-F1 scores when classifying text documents with the SVM classifier. This suggests that the term weights assigned by SRMWS are more effective and discriminative. Furthermore, RMWS closely follows SRMWS in terms of performance, demonstrating the potential benefits of incorporating variants in the term weighting process for SVM classification.
Figure 9 shows that the KNN classifier achieves the best micro-F1 and macro-F1 results with SRMWS at 1500 and 2500 dimensions. At 5750 dimensions, SRMWS performs as well as SIGM. These findings suggest that SRMWS is a flexible option for dimension selection and can perform well with the KNN classifier.
Figure 10 presents a noteworthy finding regarding the NB classifier. All approaches, except for the TF-IDF method, yielded identical results in all dimensions for both micro-F1 and macro-F1 criteria. The binary class structure of the dataset plays a pivotal role in obtaining these results. In binary class datasets, the performance of different term weighting methods may exhibit minimal variations. This stems from the fact that only a few key terms are sufficient to discriminate between the two classes. In this case, frequency-based methods such as TF-IDF cannot provide a significant advantage compared to other methods.
Figure 11 presents a comparison of the performance of various weighting approaches when employing the SVM classifier. Excluding dimensions 1500 and 2500, the SIGM, SIGMimp, and SRMWM schemes attain the highest values in both micro-F1 and macro-F1 metrics. However, following these approaches, RMWS yields the best results in all dimensions, except for the 750th dimension. The results obtained with the KNN classifier are presented in
Figure 12. As can be observed, SRMWS clearly achieves the highest values based on both criteria in all dimensions. Again, the KNN classifier achieves the best results after SRMWS in all dimensions except for the 750th dimension with RMWS. The best NB results are obtained with RWMS in micro-F1 and macro-F1, achieving the highest score, except for dimension 750. In the 750th dimension, the highest score is obtained with SRMWS. These cases are illustrated in
Figure 13. In summary, for this dataset, results regarding term weighting schema are as follows: (a) In SVM and KNN classifiers, the SRMWS method stands out as the most effective approach. (b) The SIGM, SIGMimp, and SRMWM methods also exhibit high performance in the SVM classifier. (c) For the NB classifier, the RWMS method yields the optimal outcomes.
As a result, in this study, two novel methods, RMWS and SRMWS, have been proposed and compared with existing approaches. The comparison was conducted on imbalanced-multiclass, imbalanced-binary class, and balanced-multiclass datasets. The obtained results clearly demonstrate that the RMWS and SRMWS methods outperform existing approaches on imbalanced-multiclass, imbalanced-binary class, and balanced-multiclass datasets. It was observed that RMWS and SRMWS methods consistently exhibit superior performance without being significantly affected by class imbalances and the number of classes in the dataset. Furthermore, it was determined that the SRMWS method outperforms the RMWS method on balanced-multiclass datasets.
5.6. Statistical Analysis
This section presents a comprehensive statistical analysis to assess whether the RMWS and SRMWS approaches yield meaningful results. The initial test examines the average performance of term weighting approaches across all datasets, specifically focusing on results obtained with the same term dimension. This allows us to assess whether an approach delivers consistent results regardless of the specific dataset. Moreover, certain approaches may perform well on specific datasets but exhibit poor performance on others, indicating a lack of consistency. While an approach is not expected to yield the best results on every dataset, it should not produce excessively poor results either. Consistency is crucial for evaluating an approach’s reliability and overall performance. To facilitate this analysis,
Table 15,
Table 16 and
Table 17 have been devised for each classifier, specifically SVM, KNN, and NB, respectively. Furthermore, the notation ’fs’ within the tables denotes the term dimensions and the best results are highlighted in bold within the table.
A close examination of
Table 15 and
Table 16 reveals that SRMWS clearly produces the most successful results. This finding indicates that SVM and KNN classifiers exhibit more effective performances when coupled with SRMWS, rendering this method preferable for these classifiers. Moreover, SRMWS consistently outperforms other methods across all datasets and term dimensions, establishing a statistically significant difference.
An examination of
Table 17 reveals that RMWS is the most successful method when used with the NB classifier. This finding indicates that the NB classifier exhibits more effective performance when coupled with RMWS, rendering this method preferable for this classifier. While other methods achieve similar results to RMWS in some term dimensions, RMWS generally emerges as the best-performing method. This demonstrates that RMWS consistently outperforms other methods when used with the NB classifier, establishing a statistically significant difference.
A
t-test was also used as a statistical analysis to demonstrate the validity of the proposed best-performing RMWS and SRMWS. For this purpose,
Table 18 is constructed for RMWS and
Table 19 for SRMWS. The tables show the results of one-sided, paired
t-tests for the obtained
p-values. If the
p-value is below 0.05, the obtained results are deemed statistically significant. In particular, a
p-value below 0.05 signifies statistical significance at a confidence level of 95%. Furthermore, if the
p-value is less than 0.01, the results achieve statistical significance at an even higher confidence level of 99%
The results demonstrate that the performance gains achieved with the proposed RMWS weighting scheme compared to other schemes are statistically significant, with a very high confidence level of 99% for NB. Additionally, this confidence level is also at 99% for all values except one
p-value for the SVM classifier. As seen in
Table 18, for the KNN classifier, almost all
p-values are at 95% and 99% confidence levels, except for a few cases. When
Table 19 is examined, it is observed that SRMWS provides results with a very high confidence level of 99% for all classifiers. The obtained
p-values indicate that the RMWS and SRMWS schemes perform significantly better than other schemes. This implies that the possibility of random coincidence is low, and the findings are reliable.
As a result, these findings clearly validate the superiority of the proposed RMWS and SRMWS weighting schemes compared to other schemes. Both schemes provide a statistically significant performance increase when used with NB, SVM, and KNN classifiers.
6. Discussion
The findings from our study indicate that the proposed methods, RMWS and SRMWS, significantly enhance the performance of text classification tasks. By exploring the class–term relationship using rough sets, these methods provide a novel approach to term weighting that outperforms both traditional and contemporary methods. The RMWS approach utilizes rough sets to identify terms that offer specific and distinctive information relevant to different classes. This capability allows for the identification of patterns within documents that may otherwise be overlooked by other term weighting schemes. By incorporating the coefficients α, β, and γ, RMWS can adjust the influence of these terms more accurately, leading to superior classification results. The SRMWS further refines this process by taking the square root of RMWS values, making the term weights more balanced and discriminative.
Experimental results revealed that RMWS and SRMWS consistently outperform existing term weighting schemes, regardless of dataset structure. This includes imbalanced-multiclass, balanced-multiclass, and imbalanced-binary class datasets, demonstrating the robustness of the proposed methods. Notably, SRMWS showed the highest classification performance across most scenarios, suggesting that the square root transformation adds significant value in enhancing term distinctiveness.
One critical finding is the statistical significance of the performance improvements offered by RMWS and SRMWS. The p-values from our t-tests indicate that the differences in performance are not due to random chance. This statistical validation underscores the efficacy of the proposed methods and their potential utility in practical applications.
When comparing classifiers, the NB classifier showed the greatest improvement with RMWS, while the SVM and KNN classifiers benefited more from SRMWS. This suggests that the choice between RMWS and SRMWS may depend on the specific classifier used and the nature of the dataset. Future studies could explore this relationship further, examining how different classifiers interact with these term weighting schemes under various conditions.
This study’s results also highlight the importance of proper coefficient selection for optimizing the performance of RMWS and SRMWS. The α, β, and γ coefficients have been shown to significantly impact classification outcomes, necessitating careful calibration based on the dataset and classifier used. This aspect presents an opportunity for future research to develop automated techniques for coefficient tuning, potentially using machine learning algorithms.
Overall, the incorporation of rough set theory into term weighting presents a promising direction for improving text classification. By focusing on revealing hidden patterns and specific class–term relationships, RMWS and SRMWS offer a more nuanced and effective approach to term weighting. The findings from this study contribute to the field by presenting robust, statistically validated methods that outperform existing approaches, paving the way for future advancements in automated text classification.