4.2. Evaluation Metrics
We evaluated four metrics—precision, recall,
F1-
measure, and accuracy—defined by Equations (5)–(8), respectively [
43]. A true positive (
TP) refers to the case in which high-preference content was recommended, and a true negative (
TN) refers to the case in which low-preference content was not recommended. Conversely, a false positive (
FP) indicates a recommendation of low-preference content, and a false negative (
FN) indicates content with a high preference not being recommended. The precision denotes the proportion of predicted positive cases that are correctly identified as
TPs. The recall is the proportion of TP cases that is correctly predicted, indicating the hit ratio of the recommendation results. The
F1-
measure is the harmonic mean of both the precision and recall and, in this study, represents a performance evaluation considering the trade-off between precision and recall.
Furthermore, we evaluated novelty, which is the proportion of unknown items that do not exist in a personalized tree over the total number of items in the recommendation list. A total of
items were recommended; if the personalized preference for the
item is
, novelty can be calculated using Equation (9):
4.4. Results
We conducted the first experiment by changing the constant
to determine whether the proposed preference tree is suitable for the recommendation system.
Table 2 lists the average of the metrics measured based on the value of
. In this experiment, the values of
and
were set to 0.5. The two factors affecting the similarity prediction are breadth and depth, which indicate diversity and granularity, respectively. As the value of
increased and decreased, the recommendation system focused on diversity and granularity, respectively. We observed that these two factors with
= 0.5 exhibited the best performance when the effects on the preference prediction were balanced. This observation validated that the proposed system performs best when the diversity and granularity of trees are considered at the same rate during the prediction of preferences. Therefore, the proposed preference-tree model is not biased toward diversity or granularity. This result confirms that the proposed preference-tree model is suitable for predicting user preferences.
In the second experiment, we evaluated the performance of the similarity tree, which prevents the recommendation system from overfitting the historical data.
Table 3 lists the average metrics based on the similarity threshold
. During the experiments, the values of
and
were set to 0.5. When
= 0.4, the similarity tree included a preference tree with a low relevance. As the value of
was small, a personalized tree with a low similarity was included in the preference prediction. Therefore, a wide range of recommendations could be made. We observed that the novelty was highest when
was 0.4. Conversely, when
was 0.7, precision and recall exhibited their best performances because the recommendation system considered a similarity tree. However, as the personalized tree significantly influences the preference prediction, the novelty performance was lowest when
was set to a large value. In subsequent experiments,
was set to 0.5 to ensure that the recommendation system could balance the user preference characteristics and novelty.
In the third experiment, we evaluated the effect of the proportional constant
of the personalized, federated, and similarity trees on the performance of the recommendation system. Federated and similarity trees predict generalized preferences, whereas personalized trees predict personalized preferences.
Table 4 summarizes the average measured values of the five metrics based on the value of
. As was confirmed in the second experiment, the federated and similarity trees mitigated the overfitting problem by reducing the influence of personalized recommendations. Additionally, as the value of
decreased, the performance of the novelty was low while that of the recall was high. Based on previous experiments, we confirmed that the recommendation system ensures novelty while maintaining a high accuracy at
=
= 0.5 and
= 0.6.
To evaluate the performance of the proposed recommendation system, we conducted comparative experiments using MF-, max-heap-tree-, and knowledge-based recommendation systems. Furthermore, we conducted an ablation study to evaluate the performance of the proposed federated, personalized preference, and similarity trees. The MF- and knowledge-based schemes measure metrics using the top-N method, with which N is set to 20.
Table 5 summarizes the performance evaluated based on the five metrics of the compared recommendation schemes. Proposed Scheme A exhibited a low novelty performance because it predicts preferences based on a personalized preference tree. As Proposed Schemes B and C predict preferences based on federated and similarity trees, respectively, the novelty and recall performance improved in comparison with that of Proposed Scheme A. These results indicate that an overfitting problem may occur when the recommendation system considers only the personalized preference tree. As the max-heap-tree-based system predicts preferences based on the maximum preference score for each node, it has difficulty considering various factors during the preference prediction. Consequently, the max-heap-tree-based scheme exhibited the lowest performance for all the metrics. The MF-based recommendation system has better novelty and accuracy performances than Proposed Scheme A because it predicts the preference based on only generalized preferences. However, both the max-heap-tree- and MF-based methods are unsuitable for real-time recommendations because they recommend items using the top-N method. The knowledge-based scheme predicts preferences based on the TF-IDF similarity vector between items; it exhibits the best performance in terms of precision. However, as the prediction of preferences is impossible for items that do not exist in the TF-IDF vector, we did not measure its novelty performance. In the case of Proposed Scheme D, which predicts the preference based on all trees, its accuracy and novelty performances were the highest among the compared schemes. Based on these experimental results, we confirmed that recommendation systems using the three types of trees can recommend with high accuracies.
Finally, we evaluated the runtimes when updating the proposed tree models and predicting their preferences. The runtimes were measured using approximately 100,000 historical data points used in the previous experiment.
Figure 7 illustrates the runtime results for Proposed Schemes A, B, C, and D. We observed that Proposed Scheme B used additional federated trees, which increased the runtime slightly compared with that of Proposed Scheme A. Proposed Scheme C exhibited a significantly increased runtime compared with that of Proposed Scheme A because it has to identify similarity groups and search for LCC nodes in similarity trees. In the case of Proposed Scheme D, the tree updating runtime was measured to be 9.11 s, which is the largest runtime as all federated, personalized, and similarity trees should be updated. However, Proposed Scheme D can be applied in a real-time recommendation system because it is possible to predict user preferences within approximately 0.13 s. Consequently, the proposed recommendation system satisfies real-time requirements and provides a better prediction performance than existing systems without additional computations or requirements.