This section is organized into several detailed subsections to clearly present our findings. It starts with an overview of the dataset we used, followed by a description of the methodology, including how the dataset was processed; what features, algorithms, and tools we used; and how the experiments were carried out. The two final subsections then present the results from the classification and clustering algorithms.
4.2. Methodology
Our goal was to separate, i.e., classify, complex sentences from simple ones. We approached the problem as both a classification and a clustering task. To do this, we extracted certain features from our dataset that could help classify the sentences into one of the two classes.
Figure 1 clearly illustrates each step of our methodology and explains the connections between them.
As previously mentioned, this paper focuses on the effect of shallow characteristics on low-level readers. For this purpose, we used the TextStat Python library [
33], which enables the calculation of statistics from text and assists in determining readability, complexity, and grade level. This tool extracts a variety of shallow features and formulas while being flexible in terms of its input by allowing multiple sentences at a time to be analyzed individually. We also experimented with the well-known Coh-Metrix web tool [
10,
34], but it appeared to be inappropriate for batch analysis of multiple individual sentences at the time of access. In addition to Coh-Metrix, we ran some tests using the TAACO [
35] and TAASSC [
36] feature extraction tools. These are two useful tools for POS feature extraction that use CoreNLP, but they did not provide a viable solution for batch analysis of individual sentences for the task at hand.
Making use of TextStat, we created two separate Python scripts. The first Python script performs feature extraction. It takes as input the desired dataset (either simple or complex) and exports a correctly formatted CSV file with the feature values of each sentence in the dataset. The second script takes the exported CSV files and merges them while shuffling the complex and simple sentences, creating a CSV with 8001 total rows, with the columns representing the features/characteristics of each sentence, and the final column representing the complexity.
Taking into account the reviewed literature on the most effective features for the target audience, along with the capabilities of the software used, we decided to employ the following list of features in our lab tests:
Flesch reading ease;
Flesch–Kincaid grade level;
Automated Readability Index;
Gunning Fog;
McAlpine EFLAW;
Linsear Write Formula;
Dale–Chall readability score;
Coleman–Liau index;
Spache readability;
Syllable count;
Monosyllable count;
Polysyllable count;
Word count;
Mini-word count;
Difficult words;
Reading time;
School grade.
We further discuss these features in
Section 3. Many of them are self-explanatory since shallow features are simple and consider surface-level characteristics. Our tool’s unique feature, “School Grade”, estimates the grade level needed to understand a sentence, based on various readability formulas (more information can be found in [
24]). To clarify, since some of the above formulas take into account the number of sentences, their exported values might be significantly different from those if used for readability assessment of whole texts. Even though this is true, they still serve their purpose since their values just tend to have different ranges per classification group (either complex or simple). We also could not use the SMOG index for this reason since in TextStat, inputs of fewer than 30 sentences are statistically invalid because the SMOG formula was normed on 30 sentence samples.
To run our lab tests, we used the Waikato Environment for Knowledge Analysis (WEKA) [
12]. WEKA is a collection of machine learning algorithms that also provides tools for data preparation, classification, regression, clustering, and visualization. In our tests, we ran multiple algorithms on our dataset using our extracted features, seeking those that would return the best percentage of correctly classified examples. In addition, we introduced several feature groups to examine whether certain combinations of features and their multitude offered better or worse results. This is supervised learning, and the ultimate task is to classify our dataset sentences as either complex or simple. We also used 10-fold cross-validation. As a baseline test, we ran the ZeroR classifier using all the features, which achieved 50% accuracy.
4.3. Results for Classification Algorithms
In this subsection, we present our results for the search for the optimal configuration. We ran tests for different feature sets using the following algorithms:
We ran each of the above algorithms multiple times, modifying the parameters to attain optimal results, and then performed a comparative analysis to determine which was the most efficient. The first test used all the features previously introduced to distinguish between simple and complex sentences.
Table 2 shows the results of our first test. The first column represents the algorithms/classifiers used and their parameters, where CCI stands for correctly classified instances. The top three test results are highlighted in three different shades of blue: the more vibrant shade represents the top result, and the lighter shades represent the next two best results. The worst result is highlighted in red.
The best algorithm in this case appeared to be Random Forest, with a bag size equal to 1 and 10,000 iterations, achieving 60.5625% accuracy. We explain more about its parameterization later in this paper. It was followed by Naive Bayes with supervised discretization, achieving 59.9875%, and KNN (using the 3501 nearest neighbors, almost half the dataset’s size) achieving the same result. In NB, each feature was conditionally independent of other features. We believe that the reason for the aforementioned accuracy value is the size of the dataset and the fact that in our feature set, the likelihood was evenly distributed, and our results throughout the tests may support this.
J48 also performed well and managed to increase accuracy slightly. By setting the confidence interval for pruning extremely low to just 0.05, we obtained slightly better results (almost a 2% increase). By reducing the CI, we also reduced the depth of the J48 tree through pruning. As expected, adding the unpruned option (-U) made things worse, as the wider and deeper tree caused an overfitting effect. Overfitting happens when the model is unable to generalize and instead fits too closely to the training dataset. We keep in mind that outliers in our 8 K dataset could have also played a role.
The worst-performing algorithm was KNN with exactly one neighbor, underperforming significantly compared to the others. It was common across all our tests that KNN with fewer neighbors performed worse on this dataset. We conducted further tests to determine the number of neighbors that would give us the highest accuracy. Other than that, it seems that all algorithms performed similarly, which is quite surprising.
Subsequently, we closely examined the obtained feature extraction results to identify the most prominent ones. This allowed us to run tests using only the five most useful features for the classification task to see if performance could be improved. Initially, we measured the average/mean values of each feature extracted from our dataset per classification group (complex and simple). For example, complex sentences have an average value of 2.88575 for the polysyllable count feature, while simple sentences have an average of 2.1705. Then, we measured the difference in the averages between the complex and simple sentences. Another example is the mean value of the Flesch–Kincaid grade level, which was 8.31635 for simple sentences and 10.195225 for complex sentences. The difference between the averages in this case was 1.9.
We then converted these difference values into percentages, as shown in
Figure 2, in order to have a reading that can be interpreted across all features. Not all features returned values belonging within the same range, so converting to percentages helps make comparisons more meaningful. The figure shows the difference in the mean/average values of each feature between the complex and simple sentences. These values were sorted from top to bottom based on the percentage difference in the mean values between the two classification groups.
The differences in the values per classification group for the features through the WEKA interface are shown in
Figure 3. In this figure, the X-axis represents the values that each specific feature takes, while the Y-axis (colored) is the classification group. The five features we selected were the polysyllable count, difficult words, Linsear Write Formula, Automated Readability Index, and Flesch–Kincaid grade level. Based on the visualizations and averages, we can clearly observe that complex sentences introduce difficult words with many more syllables compared to simple ones. Since the Automated Readability Index, Flesch–Kincaid grade level, and Linsear Write Formula take into account the number of characters and/or syllables, it is not surprising that their average differences between the simple and complex sets were slightly higher, rendering them appropriate for this task.
We also examined the Linear Regression model considering all the features but excluding the school grade (which is a feature specific to TextStat that returns a nominal value; our only feature that returns a nominal output). For this, the complexity had to be converted to a numeric value, so we used 1 for complex and 0 for simple sentences. The regression was performed using the greedy method and unfortunately returned a 94.3797% relative absolute error.
We ran the same tests as before, this time using the handpicked feature set we discussed previously, obtaining the results presented in
Table 3. The best algorithm in this case was KNN with 551 neighbors, while the variant with 3501 neighbors performed equally well. The lazy approach with narrow samples did not seem to work until now; widening the number of neighbors significantly increased the accuracy, even in this feature set.
The Naive Bayes variants, including NB-K and NB-D, achieved competitive accuracies of 59.675% and 59.7%, respectively. The worst-performing algorithm once again was KNN with a single neighbor. Naive Bayes generally performed well across tests when these specific parameters were applied.
Supervised discretization helped to create “bags” containing a range of values rather than multiple discrete ones. Generally, kernel density estimation and supervised discretization were expected to work better due to the distribution of our dataset, which benefited from such modification. Simple Naive Bayes assumed that the numerical data followed a normal distribution. In this case, supervised discretization achieved around 0.06% higher accuracy than kernel density estimation. Our tests suggest that either technique would work well.
Once again, we observed that all algorithms performed similarly when tweaking some simple parameters, with the only ones that significantly underperformed being Random Forest with no parameterization and with a bag size set to 25%, as well as KNN with a single neighbor.
In addition to the previously selected feature set, which consisted of five handpicked features through our custom selection process, we decided to create another feature set consisting of the top five features selected using the information gain (IG) algorithm through WEKA. IG allowed us to determine the importance or prominence of the features in our dataset based on their classification results. By evaluating our features through IG, we estimated their entropy, which is the amount of information (or surprise) inherent to the output variable’s possible outcomes. Lower-probability (surprising) events have more information, while higher-probability ones (less surprising) have less. Information gain measures the reduction in entropy. Based on this evaluation, we selected the top five features with the highest information gain.
A notable observation from the results of IG, as seen in
Table 4, is that three out of the five features with the highest information gain were those that included less common variables in their formulas, rather than simply the number of words or characters. Spache readability includes unfamiliar words, Flesch–Kincaid considers syllables, and Gunning Fog takes into account complex words. This observation may be useful when creating features and formulas for similar datasets.
We ran the same tests on the top five features set based on the IG selection, obtaining the results shown in
Table 5. Again, Naive Bayes with supervised discretization achieved the highest accuracy of 59.95%, but the difference from other algorithms was negligible. All the results were once again very consistent, regardless of the algorithm and feature set used, reaching nearly 60%. Naive Bayes with kernel density estimation and KNN with 3501 neighbors followed, with under 0.3% worse performance. The unpruned version of J48 was again less effective than others, while KNN with a single neighbor was once again the worst. The results seemed to follow the same pattern as the previous tests.
It is normal for a lazy algorithm like KNN to perform poorly, especially when multiple features and a spread mix of complex and simple sentences are taken into account. It was clear that a low number of neighbors would not suffice since we had 8 K sentences split into two groups based on complexity. Using the -K flag, we increased the number of neighbors used for the final decision. We observed a consistent pattern with this algorithm across all our tests. We ran the KNN algorithm with different feature sets using 1, 13, 71, 551, and 3501 neighbors (we ensured all the numbers were odd to avoid confusion). Having a single neighbor usually did not perform well. In our case, increasing the number of neighbors significantly increased accuracy, since a limited number of neighbors could not cover the different example cases provided. We found that the sweet spot in most cases was setting the number of neighbors to of the total examples.
Examining our results, as shown in
Figure 4, we observe that the lines representing test runs with sets including five features are almost overlapping and form a similar curve. They are also relatively close to the test runs using all the features. The peak performance of these lines lies in the middle to the end of
Figure 4. On the other hand, tests with a single feature or pair of features tend to have a much smoother line, especially the line representing the results using McAlpine EFLAW as a single feature, which is almost straight regardless of the number of neighbors.
After considering the literature review, we decided to also run tests based on the Flesch–Kincaid grade level and word count as a feature set, since the literature suggests that this combination can offer great accuracy in certain cases [
5]. Evaluating the obtained results, it appears that the combination of these two features is indeed relatively good at assessing readability. The results can be seen in
Table 6.
Finally, we examined the predictability of a special feature we selected for this study that focuses specifically on L2 readers. We ran tests using McAlpine EFLAW as a single feature, which yielded interesting results, as shown in
Table 7. It should be noted that even though shallow features are usually considered unstable or unreliable, McAlpine EFLAW was very consistent throughout the tests, regardless of the classifier and parameterization. Although the accuracy was not as high as in other tests, it was consistently in the 57th percentile. McAlpine EFLAW is likely a good feature for reliable readability assessment of text focusing on L2 readers since its predictions in our tests, while not as high as those from our other feature sets, were at least very consistent.
In all of our results tables, it can be seen that the Random Forest (RF) algorithm achieved the best results when we increased the number of trees and significantly reduced the bag size. RF used all possible features in every tree. In RF, if the observations (the trees created based on the examples) exceed the limited number of trees, certain observations will be predicted just once or not at all, usually leading to poor results. It is obvious that increasing the number of iterations (random trees) will increase the performance of the algorithm, as trees with good predictions will clearly outnumber those with bad predictions. The best results in Random Forest are seen with a higher tree count and lower bag size. We highlight the importance of the bag size percentage in the Random Forest algorithm. It is evident that in all of our Random Forest test runs, decreasing the bag size yielded better accuracy. Most of the time in our tests, when reducing the bag size to 1, meaning that each bag contained only 1% of the training set, RF performed the best. This is an extreme case with unconventional results, but similar to the extremely low confidence interval in J48, the same seems to work for RF with the bag size percentage in this case. However, these results might arise due to overfitting despite the 10-fold cross-validation, so we ran further tests later on.
To ensure that this parametrization significantly impacts most cases and to attempt to learn what affects the performance of the classifier the most, we decided to run further tests, tuning it even more. In the tests presented in
Table 8, we used the exact same dataset and altered the parameters concerning the bag size, the iterations, and the number of features. From the information in
Table 8, we can see that increasing the number of iterations (to at least 1000) yielded significantly better results in most cases. Moreover, as seen in the previous tables, reducing the bag size percentage to as low as 1% yielded the best results in our experimental setup with this dataset. It appears that the best-performing setting is with the bag size percentage set to 1% while performing 10,000 iterations. So, we conclude with a well-known fact that supports our results: increasing iterations also increases performance, as does decreasing the bag size. The best run in these tests returned over 60% of correctly classified instances, while the worst, with bag size set to 100% and 10 iterations, returned a below-average 42.4%.
Drawing definitive conclusions is challenging since, based on the above, the fact that increasing the bag size yields worse results (even by about 13% to 17% in some cases) might suggest an overfitting issue. On the other hand, we cannot say this definitively since when setting the bag size to 100% (which is a technique prone to overfitting), we unexpectedly obtain even lower correctly classified instances than when using 50% or 1%.
Through the graphical visualization of the results (
Figure 5 and
Figure 6), we observe that both lines representing tests using 5 and 17 features have a similar curve, despite the accuracy being lower in the case of the 5-feature set. We also observe a decrease in accuracy when increasing the number of iterations during the test runs with a bag size of 50%, but considering all of our previous tests, we can say that, in general, increasing the number of iterations increases the accuracy slightly, as does the bag size percentage, at least in our setup.
We should also clarify that we could not run tests for 10,000 iterations with bag sizes of 50% and 100% due to hardware and tool limitations. The model building either took too long or WEKA ran out of memory and constantly crashed.
Concluding this part of the research, we present the graph in
Figure 7, which shows the best and worst percentages of correctly classified instances per feature group in our tests, regardless of the classifier used. Surprisingly, through our classifier tuning, we achieved almost identical performance across all sets and algorithms. There were certain exceptions like KNN with a single neighbor and Random Forest without parameterization. We also observed that when using all of our available features, we generated the best results, while the lowest accuracy was recorded with McAlpine EFLAW as a single feature. However, the difference was so small (≲3%) that we cannot say that EFLAW underperformed. In addition, McAlpine EFLAW showed high tolerance across all of our tests, with the best and worst performance differing by only 1.97%, while the rest of the feature sets exhibited some fluctuations in performance depending on the algorithm used. The same can be said for the Flesch–Kincaid set with word count to some extent. These fluctuations can be attributed to the number of features used.
4.4. Results for Clustering Algorithms
After completing our main goal of approaching the problem as a classification task, we decided to explore the clustering approach as well. To this end, we again utilized the WEKA environment and algorithms like Expectation Maximization, K-means, and DBSCAN. We used single features and feature sets previously showcased to compare the results of clustering with those from classification.
In these tests, we made use of WEKA’s class-to-cluster evaluation. In this mode, WEKA initially ignores the class attribute (in our case, complexity) and generates clusters. Then, during the test phase, it assigns classes to the clusters based on the majority value of the class attribute within each cluster. Finally, it computes the classification error based on this assignment and creates the corresponding confusion matrix. Performance was measured using true positives, true negatives, and false positives as reported by WEKA, and then we calculated the accuracy, precision, recall, and F-measure for the clusters. In the following tables, “Log L.” stands for log likelihood, “S.S.E.” denotes the sum of squared errors, and “CCI” denotes the correctly classified instances. All the tests were run with the number of clusters set to two, except for DBSCAN, where this parametrization was not possible.
Our first tests used the Expectation-Maximization (EM) algorithm. The recall and F1 values were acceptable compared to the results from the tests using classification algorithms, and the highest accuracy was equally satisfying, reaching 59.6%, which is comparable to the classification approach. The analytical results can be seen in
Table 9 and
Table 10.
K-means is one of the most popular and commonly used clustering algorithms. It is quite simple since it works by assigning data points to clusters based on the shortest distance to the chosen centroids. In our case, we used two centroids since our parametrization of the algorithm through WEKA allowed us to set the desired number of clusters—one for complex and one for simple sentences. By continually assigning points/examples closer to the centroids, the clusters were updated with each iteration. The results were quite encouraging, reaching an accuracy of 58.98% and an F1 score of 0.61 when using the information gain feature set. The analytic results regarding the K-means algorithm can be seen in
Table 11 and
Table 12.
We also ran tests on the DBSCAN algorithm but we did not include a results table because, as expected, it was unable to cluster the examples correctly since they were extremely close to each other. Due to the density of the examples, DBSCAN created only a single cluster and achieved 50% accuracy. This is because our dataset consisted only of two clusters—one complex and one simple—so placing everything in a single cluster was bound to result in 50% accuracy by chance.
In addition to the above tables, we visualized the clusters created (as seen in
Figure 8 and
Figure 9) to obtain better insights regarding the situation at hand.
In conclusion, it seems that the clustering approach is very much on par with the classification approach regarding performance. Looking at the results, we find that the increased number of false negatives compared to false positives indicates that the problem lies in the assignment of many complex sentences to the simple cluster. The opposite situation is not as common in our tests.