1 Introduction
The central challenge of various computational disciplines is the construction of algorithms that automatically improve with experience [
32]. An important class of such algorithms is concerned with supervised classification, in which an algorithm is presented with a training set of objects, each of which has a known descriptive vector of measurements, and each of which has a known class membership. The aim is to use this training set to create a
classification method (also known as a classifier, classification model, classification algorithm, or classification rule) that can classify future objects based solely on their descriptive feature vectors. In what follows, for convenience, we will assume just two classes of objects (binary classification), labelled 0 and 1 (commonly named the negative and positive class, respectively).
A general high-level strategy for constructing classification methods is to begin by assigning scores to the objects to be classified. In particular, with two class problems, we might score the objects by their estimated probability of belonging to class 1, say. Given such classification scores, the appropriate way to proceed will depend on the problem. If the set of objects to be classified is known and finite (for example, if we wished to classify a given set of patients as having cancer or not), then all objects can be ranked and the top x or the top y% are assigned to class 1 (having cancer). However, if the set of objects is not fully known, or is of arbitrary and unknown size (for example classifying future patients to have cancer or not), then it is not possible to rank them all and choose the ones with the highest scores.
In such a case, a classification threshold must be chosen, so that those with scores larger than the threshold are assigned to class 1. This threshold could be chosen so as to classify a given percentage to class 1 (based on a known or estimated distribution of scores). Alternatively, it could be chosen in absolute terms: all objects with estimated class 1 probability greater than a threshold of 0.9 could be assigned to class 1, for example. Note an important distinction between these two cases: if the top-scoring y% are to be assigned to class 1, then whether or not a particular object is assigned to class 1 depends on the scores of other objects. In contrast, if an absolute threshold is used, then objects can be assigned independently of the scores of other objects.
Central to the challenge of creating classification methods is the ability to evaluate them. This is needed so that one can decide if a method is good enough for some purpose, and to choose between competing methods (which possibly have been created using different algorithms, or the same algorithm but different parameter settings). “Choosing between” methods includes notions of algorithm selection, parameter estimation, choice of descriptive features, transformations of features, and so on, as these implicitly select between alternative methods.
Criteria for evaluating classification methods and algorithms are of two kinds: problem-based and “accuracy”-based. Problem-based measures are specific to particular applications, and include aspects such as speed of classification (such as for data streams), speed of adaptation and updating (for example, in spam and fraud detection), ability to handle incomplete data, interpretability (mandatory in certain legal frameworks), how easy they are for a non-expert to use, how effective they are in the hands of an expert, and so on.
“Accuracy”-based measures, on the other hand, are concerned with the accuracy with which objects are assigned to the correct class. In two-class problems, there are two ways in which such assignments may be in error: class 0 objects may be incorrectly assigned to class 1 (known as
false positives or
Type I errors), and class 1 objects may be incorrectly assigned to class 0 (
false negatives or
Type II errors). To produce a one-dimensional performance measure, which can be used to compare classification methods, the extent of these two types of error must be aggregated into a single value. This can be done in various ways—hence the quotation marks around “accuracy” above when describing this class of criteria: accuracy may mean different things. Various reviews which look at different aspects of classification performance measures have been written, including [
11,
15,
19,
23,
27,
50,
57], and others.
In this article, we review and examine one particular measure in detail: the F-measure, which is generally calculated as the harmonic mean of precision and recall. The F-measure was originally proposed in the domain of information retrieval [
59] to evaluate the quality of ranked documents retrieved by a search engine. In this discipline, only one class is of real interest (the relevant documents; without loss of generality, call it the positive class), and we are concerned with the proportion of that class misclassified and the proportion classified into this class which are, in fact, from the other class. These two aspects are captured by recall and precision, respectively. In particular, we are not concerned with the number or proportion correctly classified from the class of no interest. The F-measure is then a way of combining recall and precision, to yield a single number through which classification methods can be assessed and compared.
The appropriateness of the F-measure is illustrated by considering very imbalanced situations [
33] in which the uninteresting class is very large (such as all individuals who do not have cancer). In such cases, measures such as error rate will be heavily influenced by a large number of irrelevant correct classifications to the uninteresting class, and therefore can be poor indicators of the usefulness of the classification method used [
7,
46,
51,
52]. However, for more general classification problems, misclassifications (and correct classifications) of both classes are of concern. This means that the F-measure is an inappropriate measure of performance for these situations, as we discuss further in Section
4.1.
Apart from the pragmatic benefit of reducing the two measures, recall and precision, to a single number, there does not seem to be a proper justification for why the F-measure is appropriate for evaluating general supervised classification methods (where both classes are of interest). In many publications, no reasons are provided why the F-measure is used for an evaluation of classification methods, or alternatively rationales such as “because it is a commonly used measure in our research discipline”, or “because it has been used by previous relevant work” are given. Others have discussed problematic issues with the F-measure earlier [
42,
51,
60].
Contributions and outline: In the following section, we trace the use of the F-measure back to its original development in information retrieval, and its increasing use over time in diverse computational disciplines. In Section
3, we then describe the properties of the F-measure, including its generalisation, the
\(F_{\beta }\) measure. From this discussion, we see that the F-measure has characteristics that can be seen as conceptual weaknesses. We discuss these and the resulting criticism of the F-measure in Section
4. In Section
5, we then describe alternatives to the F-measure, and we conclude our work in Section
6 with a discussion and recommendations on how to use the F-measure in an appropriate way.
We have previously explored the shortcomings of the F-measure in the context of record linkage [
29], the task of identifying records that refer to the same entities across databases [
5,
13]. More recently, we proposed a variant of the F-measure that overcomes some of its weaknesses [
30]. Here, we go further and provide a broader discussion of the use of the F-measure and its application in general classification problems. Our work is targeted at a broad audience, where we aim at providing the reader with a deeper understanding of the F-measure. Our objective is also to highlight that using a certain performance measure simply because it is being commonly used in a research community might not lead to an appropriate evaluation of classification methods.
3 The F-Measure and Its Properties
Let
\(\mathbf {x} = (x_1, \ldots , x_m)\) be the vector of
m descriptive characteristics (features) of an object to be classified. From this, a classification method will compute a score
\(s(\mathbf {x})\) and the object will be assigned to class 0 or class 1 according as
—
If \(s(\mathbf {x}) \gt t\) assign the object to class 1.
—
If \(s(\mathbf {x}) \le t\) assign the object to class 0.
Here
t is the “classification threshold”. Clearly by changing
t we can shift the proportion of objects assigned to classes 0 and 1. The threshold is thus a control parameter of the classification method. The method’s performance is then assessed by applying it to a test dataset of objects with known classifications. Here, we shall assume that the test set is independent of the training set—the reasons for this and the problems arising when it is not true are well-known and have been explored in great depth (see, for example, Hastie et al. [
32]).
Application of the classification method to the test set leads to a (mis)-classification table or confusion matrix, as illustrated in Figure
2. Here,
FN is the number of test set objects which belong to class 1 but which the classification method assigns to class 0 (that is, which yield a score less than or equal to
t), and so on, so that the off-diagonal counts,
FP and
FN, give the number of test set objects which have been misclassified. This means that
\((FP + FN)/n\) , with
\(n = TP + FP + FN + TN\) the total number of objects in the test set, gives an estimate of the overall misclassification or error rate [
27], another widely used measure of classification performance.
In what follows, we shall regard class 1 objects as “cases” of relevance or interest (such as exemplars of people with the disease we are trying to detect, of fraudulent credit card transactions, and so on). In terms of this table, recall,
R, is defined as the proportion of true class 1 objects which are correctly assigned to class 1, and precision,
P, is defined as the proportion of those objects assigned to class 1 which really come from class 1. That is
—
Recall \(R = TP/(TP + FN)\) ,
—
Precision \(P = TP/(TP + FP)\) .
The F-measure then combines these two using their harmonic mean, to yield a univariate (single number) performance measure:
In numerical taxonomy, this measure is called the Dice coefficient or the Sørensen-Dice coefficient [
2,
56,
58] and it also goes under other names. In particular, the F-measure as defined above is commonly known as the F
\(_1\) -measure (or balanced F-measure), being a particular case of a more general weighted version [
54], defined as
This can be rewritten as Equation (
2), with
\(\beta ^2 = (1 - \alpha) / \alpha\) , where with
\(\alpha =1/2\) (
\(\beta =1\) ), we get the F
\(_1\) -measure, short for F
\(_{\beta =1}\) [
44]. As we discussed in Section
2.1, the F
\(_{\beta }\) -measure was derived from the effectiveness measure,
E (Equation (
1)), as
\(F_{\beta } = 1 - E\) , developed by Van Rijsbergen [
59] who writes that it
measures the effectiveness of retrieval with respect to a user who attaches \(\beta\) times as much importance to recall as precision (page 123).
2 However, as we show in Section
4.3, the F-measure can be reformulated as a weighted arithmetic mean. In this reformulation, the weights assigned to precision and recall in the F
\(_{\beta }\) -measure do not only depend upon
\(\beta\) but also on the actual classification outcomes.
Precision and recall reflect different aspects of a classification method’s performance, so combining them is natural. Moreover, both are proportions, and both have a representational meaning, a topic we return to in Section
4.5. Precision can be seen as an empirical estimate of the conditional probability of a correct classification given predicted class 1 (
\(Prob(True=1|Pred=1)\) ), and recall as an empirical estimate of the conditional probability of a correct classification given true class 1 (
\(Prob(Pred=1|True=1)\) ). An average of these, however, has no interpretation as a probability, and unlike many other performance measures also has no representational meaning [
24].
The mean of precision and recall does not correspond to any objective feature of classification performance; it is not, for example, an empirical estimate of any probability associated with a classifier method. Formally, and as we discuss further in Section
4.5, the F-measure, as a harmonic mean, is a
pragmatic measurement [
22,
24,
34,
47]: it is a useful numerical summary but does not represent any objective feature of the classifier method being evaluated. This is in contrast with representational measures which correspond to real objective features: precision and recall separately are examples, since they correspond to empirical estimates of probabilities of certain kinds of misclassification.
There is also no straightforward justification for using the harmonic mean to combine precision and recall. A formal argument is sometimes made that for averaging rates the harmonic mean is more natural than, say, the arithmetic mean, but this is misleading. One might argue that the harmonic mean of precision and recall is equivalent to (the reciprocal of) the arithmetic mean of the number of true class 1 cases per class 1 case correctly classified, and the number of predicted class 1 cases per class 1 case correctly classified. But this simply drives home the fact that precision and recall are non-commensurate.
A different argument in favour of the F-measure has been made by Van Rijsbergen [
59] using conjoint measurement. The essence of his argument is first to show that there exist non-intersecting isoeffectiveness curves in the
\((P,R)\) -space (sometimes called indifference curves: curves showing combinations of
P and
R which are regarded as equally effective), then to determine the shape of these curves, and hence to decide how to combine
P and
R to identify which curve any particular
\((P,R)\) pair lies on. In particular, he arrives at the conclusion that the harmonic mean (weighted if necessary) determines the shapes of the curves. To explore reasonable shapes for these curves, and noting that
P and
R are proportions, Van Rijsbergen [
59] (pages 122 and 123) makes the assumption of decreasing marginal effectiveness: the user of the system
is willing to sacrifice one unit of precision for an increase of one unit of recall, but will not sacrifice another unit of precision for a further unit increase in recall. For
P and
R values near zero, this leads to isoeffectiveness curves which are convex towards the origin. Curves based on the harmonic mean of
P and
R have this shape.
As we noted above, one way to look at the harmonic mean is that it is the arithmetic mean on the reciprocal of the original scale. That is, it is the reciprocal of the arithmetic mean of
\(1/P\) and
\(1/R\) , as can be seen in Equation (
3). But the reciprocal transformation is not the only transformation of the scale which will produce isoeffectiveness curves of this shape. For example, transforming to
\(log(P)\) and
\(log(R)\) will also yield convex isoeffectiveness curves (and results in the geometric mean of
P and
R, which is known as the Fowlkes-Mallows index [
20] of classifier performance). In short, the choice of reciprocal transformation, and hence the harmonic mean, seems arbitrary.
As typically used in numerical taxonomy [
56], the F-measure has more justification. Here, it is used as a measure of similarity between two objects, counting the agreement between multiple binary characteristics of the objects. Thus, referring to Figure
2 above,
TP is the number of characteristics that are present for both objects,
FN is the number of characteristics that are present for object A but absent for B, and so on. Since the number of potential descriptive characteristics of objects can be made as large as you like, the number of characteristics not possessed by either object, that is count
TN, should not be included in the measure. But this interpretation seems to be irrelevant in the situation when classification methods are being evaluated, as we discuss below.
4 Criticism of the F-Measure
Over the years, researchers have questioned various aspects of the F-measure and its suitability as an evaluation measure in the context of classification problems [
30,
51,
60]. In this section, we summarise and discuss these issues.
4.1 The F-Measure Ignores True Negatives
As can be seen from Equation (
3), the F-measure does not take the number of true negatives into account. In its original context in information retrieval, true negatives are the documents that are
irrelevant to a given query and are correctly classified as irrelevant. Their number can be arbitrarily large (with the actual number even unknown). When comparing the effectiveness of retrieval systems, adding more correctly classified irrelevant documents to a collection should not influence the value of the used evaluation measure. This is the case with precision, recall, and the F-measure [
51].
In the context of classification, however, the number of objects in the negative class is rarely irrelevant. Consider a classification method trained on a database of personal health records where some patients are known to have cancer (the class of interest and hopefully also the minority class). While the classification of positives (possible cancer cases for which patients should be offered a test or treatment) is clearly the focus, how many non-cancer patients are correctly classified as not having the disease is also of high importance for these individuals [
42]. Therefore, the F-measure would not really be a suitable evaluation measure for such a classification problem.
We illustrate this issue in Figure
3(a) and (b), where the two shown matrices have different counts but yield the same F-measure. Matrix (a) shows nearly 86% (600 out of 700) negative objects (class 0) correctly classified, while in matrix (b) over 98% of them (688 out of 700) are correctly classified. Note that, however, the classifier that resulted in matrix (a) was able to classify more positive objects (class 1) correctly compared to the classifier that resulted in matrix (b).
It is, therefore, important to understand that the F-measure is only a suitable measure for classification problems when the negative (generally the majority) class is not of interest at all for a given problem or application [
51].
4.2 The Same F-Measure can be Obtained for Different Pairs of Precision and Recall
A common aspect of all performance measures that combine the numbers in a confusion matrix into a single value is that different counts of true positives (
TP), false positives (
FP), false negatives (
FN), and true negatives (
TN), can result in the same value for a certain measure. This is the rationale behind isoeffectiveness curves [
59]. For example, for a given
\(n=TP+FP+FN+TN\) , any pair of
TP and
TN that sum to the same value will lead to the same accuracy result.
Specifically for the F-measure, from the right-hand side of Equation (
3), we can see that for any pair of triplets
\((TP_a,FP_a,FN_a)\) and
\((TP_b,FP_b,FN_b)\) arising from classifiers applied to the same dataset (so that
\(TP+FN=n_1\) ) any pair
\((TP, FP)\) satisfying
\(FP=k \cdot TP - n_1\) , will yield the same F-measure, even though precision and recall may differ between the classifiers. Here
k is a constant, where all three matrices in Figure
3 have
\(k=1.6\) . This means that potentially classification methods that achieve very different results when evaluated using precision and recall will provide the same F-measure result.
An example can be seen in Figure
3(b) and (c), where the confusion matrix (b) results in
\(P=0.942\) and
\(R=0.650\) , matrix (c) results in
\(P=0.654\) and
\(R=0.933\) , while for both these matrices
\(F=0.769\) .
The F-measure should, therefore, never be reported in isolation without also reporting precision and recall results. In situations where a single measure is evaluated, such as for hyperparameter tuning for automated machine learning [
38], using the F-measure can be dangerous because the performances of classification methods are being compared in a way that potentially can provide very different outcomes (of course, this is true whenever measures are summarised into a single number).
4.3 The Weights Assigned to Precision and Recall Depend Not Only on Alpha (or Beta)
As we discussed in Section
3, a generalised version of the F-measure allows assigning weights to precision and recall using the parameter
\(\alpha\) , see Equation (
4), or equivalently
\(\beta\) , see Equation (
2).
In an effort to understand the use of the F-measure in the context of record linkage (also known as entity resolution) [
5,
13], the process of identifying records that refer to the same entities within or across databases, Hand and Christen [
29] showed that the harmonic mean representation of the F-measure can be reformulated as a weighted arithmetic mean of precision and recall as
\(F = pR + (1-p)P\) , where
\(p = (TP+FN)/(2TP+FP+FN) = P/(R+P)\) . In this weighted arithmetic mean reformulation, however, the value of the weight
p given to recall depends upon the outcome of the evaluated classification method. When several classification methods are compared, the weight
p assigned to recall can be different if the numbers of false positives and false negatives obtained by these methods differ. From the example confusion matrices in Figure
3, we can calculate
\(p = 0.462\) for matrix (a),
\(p = 0.592\) for matrix (b), and
\(p = 0.412\) for matrix (c).
As a result, in this weighted arithmetic mean reformulation of the F-measure, the weights assigned to precision and recall do not only depend upon the values of
\(\alpha\) or
\(\beta\) , but also upon the actual classification outcomes. We describe this property of the F-measure, including an extension of the work by Hand and Christen [
29] for the generalised F
\(_{\beta }\) -measure from Equation (
2), in more detail in Appendix
A.
4.4 The F-Measure has an Asymmetric Behaviour for Varying Classification Thresholds
In Section
3, we have discussed how a specific confusion matrix for a given classification method can be obtained by setting a “classification threshold”
t to a certain value. For a given classification problem (a set of objects to be classified), by modifying this threshold the individual counts of
TP,
FP,
FN, and
TN will likely change, while their total number
\(n = TP+FP+FN+TN\) , and the numbers of actual positive (class 1) objects,
\(n_1 = TP+FN\) , and negative (class 0) objects,
\(n_0=TN+FP\) (with
\(n=n_0+n_1\) ), are fixed. Generally, lowering the threshold
t means more objects are being classified as positives, with the numbers of
TP and
FP increasing and the numbers of
TN and
FN decreasing. Conversely, increasing
t generally results in more objects to be classified as negatives, with the numbers of
TP and
FP decreasing and the numbers of
TN and
FN increasing.
Therefore, as we lower the classification threshold t, recall (R) either stays the same (no new objects in class 1 have been classified to be in class 1 with a lower t) or it increases (more objects in class 1 have been classified to be in class 1 with a lower t). Recall, however, can never decrease as t gets lower.
Precision (
P), on the other hand, can increase, stay the same, or decrease, both when the classification threshold
t is increased or decreased. A change in the value of precision depends upon the distributions of the scores of objects in the two classes, as well as the class imbalance. For example, for some decrease of the threshold
t more class 1 objects might be newly classified as being in class 1 compared to class 0 objects, while for another decrease of
t more class 0 objects might be newly classified as being in class 1 compared to class 0 objects. With large class imbalances, where
\(n_1 \lt n_0\) , precision generally decreases as
t gets lower because more class 0 objects are classified to be in class 1 (as false positives) compared to class 1 objects (as true positives). We show how precision changes for real datasets in Appendix
B.
If we assume the scale of scores, \(s(\mathbf {x})\) , assigned to objects is standardised into the range 0 to 1, we can accordingly set the threshold \(0 \le t \le 1\) . Further assuming that in the extreme case \(t=0\) all objects are classified as positives (classified as class 1) and in the case \(t=1\) all are classified as negatives (classified as class 0), then we will have the following:
—
If \(t=0\) then \(TP=n_1\) , \(FP=n_0\) , \(TN=0\) , and \(FN=0\) , and therefore \(P=n_1/(n_1+n_0) = n_1/n\) and \(R=n_1/n_1=1\) .
—
If \(t=1\) then \(TP=0\) , \(FP=0\) , \(TN=n_0\) , and \(FN=n_1\) , and therefore \(P=0\) (for convenience we define that \(P = 0/0 = 0\) )
With \(t=0\) precision, therefore, becomes \(P=n_1 / n = 1 / (ci+1)\) , a ratio which depends upon the class imbalance of the given classification problem, \(ci = n_0/n_1\) . Here, we assume that \(n_0 \ge n_1\) and therefore \(ci \ge 1\) (the negative class, 0, is the majority class and the positive class, 1, the minority class).
For a balanced classification problem with \(ci=1\) , for \(t=0\) we obtain \(P=1/2\) , \(R=1\) , and therefore \(F=2/3\) . For an imbalanced problem where, for example, 20% of all objects are positive and 80% are negative ( \(ci=4\) ), for \(t=0\) we obtain \(P=1/5\) , \(R=1\) , and therefore \(F=1/3\) . For \(t=1\) , for both problems, we obtain \(F=0\) because \(TP=0\) .
4.5 The F-Measure is not a Representational Measure
Performance measures can be categorised into
representational and
pragmatic measures [
22,
24,
34,
47]. Measures in the former category quantify some property of the attributes of real objects, while measures in the latter category assign some numerical value to objects where these values may not represent any attributes of these objects. Examples of representational measures are the height and weight of people, while a pragmatic measure would be their university GPA (grade point average) scores. GPA is a construct, without objective empirical existence.
Unlike precision and recall, which are both representational measures, the harmonic mean formulation of the F-measure is a pragmatic measure. It is a useful numerical summary but it does not represent any objective feature of the classification method being evaluated. In the quest to develop an intuitive interpretable transformation of the F-measure, Hand et al. [
30] recently proposed the
\(F^*\) (F-star) measure, which we describe in the following section.
A criticism raised by Powers [
51] is that averaging F-measure results is nonsensical because it is not a representational measure. Averaging the results of classification methods is commonly conducted in experimental studies. For example, to identify the best performing method from a set of classification methods, for each method, its results obtained on different datasets can be summarised using the arithmetic mean and standard deviation. Because recall and precision are proportions, averaging over multiple precision or recall results, respectively, will yield a value that is also a proportion and therefore has a meaning (the average recall or average precision). However, given the F-measure is a pragmatic measure, averaging several F-measure results is akin to
comparing apples with pears [
51].
6 Discussion and Recommendations
Duin [
16] and Hand [
23,
25] have pointed out some of the realities of evaluating the performance of classification methods. In addition to the problem-based aspects mentioned in Section
1, they include the fact that empirical evaluation has to be on some datasets, and these datasets may not be similar to the data that the classification method is being applied to. Moreover, there are many different kinds of accuracy measures: ideally a measure should be chosen which reflects the objectives. So, for example, we might wish to minimise the overall proportion of objects which are misclassified (perhaps leading to misclassifying all cases from the smaller class if the class sizes are very unequal), to minimise a cost-weighted overall loss, to minimise the proportion of class 1 objects misclassified subject to misclassifying no more than a certain percentage of the other class (for example, 5%), to minimise the proportion of objects classified as class 1, which, in fact, come from class 0 (the false discovery rate), and so on, effectively endlessly.
While the fact that there are only four counts in a confusion matrix (and, indeed, summing to a fixed total) means that these measures are related, it does not determine which measure is suitable for a particular problem. But the choice can be critical. Benton [
4, Figure 4.1] gave a real-data example in which optimising two very widely used measures of performance led to linear combinations of the variables which were almost orthogonal: the “best” classification methods obtained using these two measures could hardly have been more different. Things are complicated yet further by the need to choose a threshold to yield a confusion matrix or misclassification table. This has led to measures such as the AUC-PR [
14], the area under the ROC curve [
18,
40] (with its known conceptual weakness [
28]) and the H-measure [
26] (which we discuss in Appendix
C). All these measures average over a distribution of possible thresholds. And yet further issues are described by Hand [
25].
The implication of all of this is that just as much thought should be given to how classifier performance is to be measured as to the choice of what classification method(s) is/are to be employed for a given classification problem. Software tools which provide a wide variety of classification methods and their easy use are now readily available, but far less emphasis has been placed on the choice of measure of classification performance. And yet a classification method which appears to be good under one measure may be poor under another. The critical issue is to match the measure to the objectives of a given classification problem.
The F-measure, as widely used, is based on an ad hoc notion of combining two aspects of classifier performance, precision and recall, using the harmonic mean. This results in a pragmatic measure that has a poor theoretical base: it seems not to correspond to any fundamental aspect of classifier performance. With the aim at helping researchers improve the evaluation of classification methods, we conclude our work with a set of recommendations of how to use the F-measure in an appropriate way:
—
The first aspect to consider is if the F-measure is really a suitable performance measure for a given classification problem. Specifically, is there clear evidence that incorrect classifications of one of the classes are irrelevant to the problem? Only if this question can be answered affirmatively should the F-measure be considered. Otherwise, a performance measure that considers both classes should be employed.
—
As we discussed in Section
4.2, different pairs of precision and recall values can yield the same F-measure result. It is, therefore, important to not only report the F-measure but also precision and recall when evaluating classification methods [
42]. Only when assessing and comparing the values of all three measures can a valid picture of the comparative performance of classification methods be made.
—
If a researcher prefers an interpretable measure, then the
\(F^*\) -measure [
30] discussed in Section
5.1, a monotonic transformation of the F-measure, can be used.
—
If a researcher knows what importance they want to assign to precision and recall, the general weighted
\(F_{\beta }\) or
\(F_{\alpha }\) versions from Equations (
2) or (
4), respectively, can be used. For the weighted arithmetic mean reformulation of the F-measure we discussed in Section
4.3 (and also Appendix
A), we recommend to specify the weight
p assigned to recall in Equations (
7) and (
11), and correspondingly set the individual classification thresholds,
t, for each classification method being compared such that the required same number
\(TP+FP\) of objects are classified to be in class 1. Alternatively, as we discuss in Appendix
A.2, a researcher can set the weight for recall as
w and for precision as
\((1-w)\) , and instead of the F-measure explicitly calculate the weighted arithmetic mean of precision and recall,
\(wR + (1-w)P\) , for all classification methods being compared.
—
If it is not possible to specify a specific classification threshold, then we recommend using a measure such as the H-measure [
26,
28], which averages the performance of classification methods over a distribution of threshold values,
t, as we discuss in Appendix
C. Alternatively, precision-recall curves should be provided which illustrate the performance of a classification method over a range of values for the threshold
t.
The critical issue for any classification problem is to decide what aspects of classification performance matter, and to then select an evaluation measure which reflects those aspects.