Reference Vs Consensus Values
Reference Vs Consensus Values
https://doi.org/10.1007/s00769-019-01423-6
GENERAL PAPER
Abstract
Proficiency testing or external quality control provides additional means to ensure the quality of laboratory testing results.
Various methods can be considered in practice to fulfill this objective. The most commonly applied are comparison of labo-
ratory results with reference or consensus values. In this work, we study the concordance between these schemes based on
the review of a large dataset corresponding to clinical chemistry proficiency testing results. The analysis is carried out by
using several statistical methodologies (diagnostic tests, contingency tables, and test of hypothesis). Results indicate that
the conclusions obtained from these schemes can be in some cases (several analytes) markedly different. This is possibly
because some statistical assumptions to apply PT based on consensus values are violated.
13
Vol.:(0123456789)
Accreditation and Quality Assurance
under investigation. Koch and Baumeister [4] mention that laboratories for several analytes (Table 2). The number of
the best way to avoid potentially biased assigned values is participants varies from one analyte to another (see the first
to use RV, thus ensuring that the assigned value is close two columns in Table 2). Using the statistics in Table 1, the
to the true value. Also in the same sense, Baldan et al. reported value of each laboratory for a particular analyte is
[8], based on a particular dataset, carried out an economic classified, under each scheme (CV and RV), as satisfactory or
assessment between RV and CV, where it is concluded unsatisfactory. For example, using the Z score criterion, the
that the use of CV does not necessarily reduce the costs result of a particular laboratory ( xi ) is unsatisfactory when |Zi |
of a PT and that the quality assessment of laboratories is is greater than 3 [10]. The rules for establishing whether the
frequently better when RV is used. value reported by a laboratory is satisfactory or unsatisfactory
In this work, we make a comparison of clinical chemis- when the statistics D, Drel and PA are used can be reviewed
try PT results, based on CV and RV. We use a large dataset in [7]. From the values reported by laboratories in various
collected by PROASECAL [5] to carry out the analysis. rounds, we can build a table with concordances and discord-
The main objective is to quantify and characterize, through ances counts between the two schemes. The possible results
the use of several statistical methodologies, the possible of these counts are sketched in Fig. 1, where the values a, b, c
discrepancies between these schemes for the particular and d correspond to the following count of events:
case of laboratories in Colombia.
The article is organized as follows. In “Data and meth- 1. Number of true positives (a): Number of laboratories
odology” section is described the dataset studied and with unsatisfactory performance according to both
the methodology used for carrying out the analysis. In schemes.
“Results and discussion” section a discussion based on 2. Number of false positives (b): Number of laboratories
the results is given. The paper ends with a “Conclusions” with satisfactory performance according to RV and
section. unsatisfactory result under CV.
3. Number of false negatives (c): Number of laboratories
with unsatisfactory performance according to RV and
Data and methodology satisfactory results under CV.
4. Number of true negatives (d): Laboratories with satisfac-
Performance assessment of laboratories in PT is generally tory performance according to both schemes.
based on the observation of four statistics (all of these leading
to the same conclusion). These are difference (D), percent- In Table 2 are shown the counts before described for
age difference (Drel), percentage of allowed deviation ( PA ) each one of the analytes considered. These datasets can be
and Z score [9]. Given a set of reported values, the statistics arranged in 2×2 contingency tables (as described in Fig. 1)
can be calculated based on CV (consensus mean ( x∗ ) and of matched-pair studies [11], where the performance of each
consensus standard deviation ( s∗ )) or RV ( xpt and 𝜎pt ). The laboratory is a dichotomous variable whose results (satisfac-
expression to calculate each one of these statistics is shown tory or unsatisfactory) are evaluated by both CV and RV.
in Table 1. Using these statistics, we compare PT results For example, in the case of uric acid (Table 2), there were
obtained under both criteria (CV and RV) from data collected 13027 reports from laboratories (collected during the period
by PROASECAL [5] over six years (2013–2018). Specifi- of study) and the performance of each one was classified as
cally, we analyze the reported values of a large number of satisfactory or unsatisfactory according to both CV and RV.
From these, 918 (7 % of cases) and 319 (3 % of cases) were
classified, respectively, as true positives and true negatives.
Table 1 Statistics used in proficiency testing for assessing the partici- In the remaining cases, there were differences between the
pants’ performance: Difference (D), percentage difference (Drel), per-
centage of allowed deviation ( PA ) and Z score (Z) tests based on CV and RV (88 % were false positives and 2 %
false negatives). In statistics, there are several tests appro-
Statistic Consensus value Reference value
priated for the treatment of dichotomous variables as men-
Di xi − x∗ xi − xpt tioned above. Among other 𝜒 2 , Fisher and McNemar’s [11]
Di,rel
( ∗)
xi −x
( x −x )
i pt can be used in this scenario. The latter is particularly useful
× 100 × 100
x∗ xpt when two tests are performed on the same group of patients
PA i Di ∕𝛿̂E × 100 Di ∕𝛿E × 100 [11]. The datasets shown in Table 2 have this structure. For
( ∗) ( x −x )
Zi xi −x i pt each analyte, two tests (based on CV and RV) are performed
s∗ 𝜎pt
on the same group of patients (laboratories). With this in
mind, here we use McNemar’s test to assess the dependence
These are calculated based on xi (the value reported by the i-th partic-
ipant), consensus values ( x∗ , s∗ and 𝛿̂E = 3 × s∗ ), and reference values between CV and RV results. The null hypothesis is that, in
( xpt , 𝜎pt , and 𝛿E = 3𝜎pt) the population of laboratories who can be studied with both
13
Accreditation and Quality Assurance
In brackets are the corresponding proportions respect to the total number of laboratories
It is also shown at each case the P-value of a McNemar’s test
schemes (CV and RV), the proportion of them that would a+c
P̂ 1 =
obtain unsatisfactory performance from RV (call it P1 ) is n
the same as the proportion receiving unsatisfactory perfor- P̂ 2 =
a + b
,
mance from CV (call it P2 ); that is, H0 ∶ P1 = P2 versus n
Ha ∶ P1 ≠ P2 [12]. Based on the values defined in Fig. 1, with n the total number of laboratories. Alternatively,
these proportions can be estimated as the hypothesis could also be stated as H0 ∶ 𝜓 = 1 versus
Ha ∶ 𝜓 ≠ 1 , where 𝜓 is the population ratio estimated by
Consensus
Sasfactory
13
Accreditation and Quality Assurance
b/c (Fig. 1). We conduct a McNemar’s test (for each analyte) Table 3 Proportions 𝜋 , 𝛾 , T + and T − for each analyte
from values a, b, c and d given in Table 2 to test the null Analyte 𝜋 𝛾 T+ T−
hypothesis above described. The P-values of these tests are
shown in the last column of Table 2. Uric acid 0.74 0.98 0.78 0.97
Sensitivity ( 𝜋 ), specificity ( 𝛾 ) and predictive values (posi- Albumin 0.61 0.98 0.88 0.94
tive ( T + ) and negative ( T − )) are commonly used for screening Alat 0.76 0.98 0.79 0.97
and diagnostic tests [13]. These values allow measuring agree- Asat 0.79 0.97 0.73 0.98
ment between the results of a test under evaluation and that of Total bilirubin 0.74 0.97 0.76 0.97
the reference standard [14]. This principle can be adapted to Cholesterol 0.73 0.98 0.88 0.96
the context of PT to assess the agreement between consensus Creatinine 0.72 0.98 0.81 0.97
results and the obtained from reference values. In this scenario, Glucose 0.68 0.98 0.83 0.95
the test under evaluation is the consensus and the reference Ureic nitrogen 0.65 0.99 0.94 0.90
standard is the reference value. A proficiency testing based on Total proteins 0.69 0.98 0.87 0.95
consensus will be better insofar as there be a high degree of Triglycerides 0.89 0.95 0.48 0.99
agreement between their results and those obtained by using Calcium 0.74 0.98 0.84 0.96
the reference values ( xpt and 𝜎pt ). Based on a, b, c and d values CK 0.80 0.95 0.38 0.99
defined in Fig. 1 and shown in Table 2, 𝜋 , 𝛾 , T + , and T − are Chlorine 0.67 0.99 0.88 0.95
estimated, respectively, as Alkaline phosphatase 0.78 0.97 0.65 0.98
Phosphorous 0.74 0.98 0.76 0.97
a
𝜋= , Iron 0.77 0.96 0.74 0.96
a+c
LDH 0.79 0.97 0.69 0.98
d
𝛾= , Magnesium 0.72 0.97 0.69 0.97
b+d
a Sodium 0.66 1.00 0.98 0.91
T+ = , Potassium 0.84 0.96 0.45 0.99
a+b
d HDL cholesterol 0.80 0.96 0.64 0.98
T− = .
c+d Amylase 0.76 0.95 0.46 0.99
GGT 0.70 0.98 0.73 0.98
𝜋 is in this scenario the proportion of laboratories with Direct bilirubin 0.69 0.97 0.75 0.95
unsatisfactory performance according to RV who have an Ionized Calcium 0.72 0.97 0.58 0.98
unsatisfactory performance according to CV (proportion of
true positives). On the other hand, 𝛾 is the proportion of
laboratories with satisfactory performance according to RV
which also have satisfactory performance under CV (propor- T+ = 𝜋
tion of true negatives). T + is the proportion of laboratories a a
=
with unsatisfactory performance according to CV who actu- a+b a+c
ally have the unsatisfactory performance according to RV. a(a + c) = a(a + b) (1)
T − is the proportion of laboratories with satisfactory perfor- (a + c) (a + b)
=
mance according to CV and satisfactory performance under n n
RV [15]. The values of these measures (for each analyte) are ̂ ̂
P1 = P2 ,
shown in Table 3. These are used in "Results and discus-
sion" section to contribute to the explanation of differences In Eq. 1, T + and 𝜋 are estimations of the real unknown
detected between CV and RV schemes. proportions.
Note that null hypothesis H0 ∶ P1 = P2 for the McNemar’s Another way of comparing CV and RV schemes can
test can be now stated as H0 ∶ T + = 𝜋 versus Ha ∶ T + ≠ 𝜋 . be done by means of some classical statistical test. The
Using observations a, b, c, d and the definitions above men- probability distribution of the data can have an impact on
tioned, we can show this equivalence: performance assessment [2]. From [16], the distribution
of the results reported by the participating laboratories is
expected to be normal or at least unimodal and reason-
ably symmetric. If the associated probability distribution
is not normal, the assessment by consensus value may be
impaired [17]. It is also reasonable to think that, in addi-
tion to the normality, there should be a similarity between
the mean and standard deviation of consensus ( x∗ and s∗ ,
13
Accreditation and Quality Assurance
respectively) and those given by reference values ( xpt and From Table 2, there are several aspects remarkable. The
𝜎pt , respectively). The mean and standard deviation of con- percentage
( ) of laboratories with satisfactory results according
sensus need to be reliable. When these are not correctly to RV b+d is in almost all cases equal or greater than 85 %
n
estimated, the PT could be considered inconsistent [2]. (ureic nitrogen and sodium are exceptions to this pattern).
All these aspects can be studied statistically through some This result has two positive interpretations. On the one hand,
hypothesis testing procedures. Specifically, a normality it is an empirical indicator that laboratories performances
Lilliefors test [18], a one-sample t test (to compare x∗ with are in general reliable (i.e., Z score lower than 3 in a high
xpt ) and a chi-square test (to compare s∗ with 𝜎pt ) can be proportion of cases), and on the other hand, it shows (taking
used. A review of these tests can be carried out in [12]. In into account that the percentage of false positives is low) that
“Results and discussion” section are shown the results of PT based on CV have a good performance to identify true
these tests (with the data corresponding to each analyte). negatives (d values in Table 2). This point can also be estab-
In all cases, the tests are carried out after the algorithm A lished with 𝛾 values (< 95 % in all cases, Table 3).
[19] has been applied to the data. The other side of the coin is however that false negatives
(c values, Table 2) are relatively high concerning true posi-
tives (a values, Table 2). This can also be evidenced by 𝜋
Results and discussion and T + values. We can observe in Table 3 that these propor-
tions are much lower than those of 𝛾 and T − . In this sense,
McNemar’s tests P-values (Table 2), estimations of 𝜋 , 𝛾 , T + some critical results are given in analytes such as Amylase,
and T − (Table 3) and P-values corresponding to normality Potassium, CK and Triglycerides, where the T + values are
tests, t tests and chi-square tests (Tables 4 and 5) are used even lower than 50 %. These results indicate that, for several
in this Section to describe the relationship between CV and analytes, PT based on CV have deficiencies in estimating
RV. We consider in all cases a significance level 𝛼 = 5 %.
Table 4 P-values obtained Analyte (xpt , x∗) (𝜎pt , s∗) Lilliefors t test 𝜒 2 test
from three tests of hypothesis
(Lilliefors, one-mean t test and Uric acid (5.7, 5.8) (0.4, 0.5) 0.000 0.004 0.000
𝜒 2 test, respectively) based on
Albumin (4.3, 4.3) (0.3, 0.3) 0.028 0.000 0.014
data reported for participants
at one round of a proficiency Alat (36.0, 34.9) (4.0, 3.3) 0.000 0.000 0.000
testing carried out in 2018 Asat (36.0, 37.5) (3.5, 4.5) 0.024 0.000 0.019
Total bilirubin (1.6, 1.6) (0.1, 0.1) 0.540 0.945 0.246
Cholesterol (163.0, 159.6) (10.5, 10.7) 0.011 0.000 0.022
Creatinine (1.4, 1.4) (0.1, 0.1) 0.047 0.653 0.012
Glucose (110.0, 110.2) (8.5, 6.4) 0.057 0.000 0.000
Ureic nitrogen (20.0, 20.3) (2.2, 2.2) 0.000 0.118 0.041
Total proteins (5.8, 5.9) (0.6, 0.2) 0.164 0.000 0.000
Triglycerides (101, 98.5) (8.2, 10.6) 0.000 0.000 0.001
Calcium (9.6, 9.7) (0.5, 0.6) 0.257 0.017 0.056
CK (189, 187.6) (17, 18.4) 0.579 0.479 0.637
Chlorine (97, 97) (2.4 2.4) 0.027 0.371 0.170
Alkaline phosphatase (284 283.7) (21.5, 30.8) 0.149 0.000 0.000
Phosphorous (5.3, 5.3) (0.4, 0.3) 0.027 0.000 0.000
Iron (96.7, 98.5) (8.6, 10.1) 0.534 0.182 0.748
LDH (418.9, 418.9) (46.9, 46.9) 0.563 0.998 0.249
Magnesium (2.2, 2.3) (0.1, 0.2) 0.682 0.088 0.017
Sodium (141, 142) (3.5, 3.3) 0.627 0.004 0.046
Potassium (3.9, 3.9) (0.2, 0.1) 0.176 0.534 0.000
HDL cholesterol (53.7, 52.2) (4.1, 7.5) 0.362 0.041 0.000
Amylase (108, 99.6) (8.0, 11.5) 0.571 0.000 0.000
GGT (49.0, 47.7) (3.5, 4.8) 0.374 0.000 0.000
Direct bilirubin (1.1, 1.1) (0.2, 0.2) 0.037 0.993 0.123
Ionized Calcium (1, 1) (0.08, 0.08) 0.392 0.942 0.564
xpt and 𝜎pt correspond to the reference values. It is used a significance value 𝛼 = 5 % in all cases
13
Accreditation and Quality Assurance
Table 5 P-values obtained Analyte (xpt , x∗) (𝜎pt , s∗) Lilliefors t test 𝜒 2 test
from three tests of hypothesis
(Lilliefors, one-mean t test and Uric acid (9.36, 9.52) (0.41, 0.73) 0.124 0.002 0.000
𝜒 2 test, respectively) based on
Albumin (2.98, 3.03) (0.15, 0.25) 0.017 0.033 0.000
data reported for participants
at one round of a proficiency Alat (125, 119) (8.3, 14.0) 0.072 0.000 0.000
testing carried out in 2018 Asat (157, 150.7) (10.67, 14.73) 0.229 0.000 0.000
Total bilirubin (5.28, 5.25) (0.37, 0.59) 0.073 0.684 0.000
Cholesterol (262, 261.14) (11, 14.26) 0.039 0.318 0.004
Creatinine (3.85, 3.87) (0.26, 0.30) 0.019 0.294 0.844
Glucose (272, 268) (13.3, 16.6) 0.209 0.007 0.114
Ureic nitrogen (52.8, 51.8) (2.63, 4.96) 0.176 0.020 0.000
Total proteins (4.78, 4.8) (0.32, 0.36) 0.332 0.656 0.912
Triglycerides (240, 242.9) (13, 14.6) 0.000 0.000 0.837
Calcium (12.6, 12.54) (0.43, 0.6) 0.002 0.350 0.005
CK (541, 527.3) (32.3, 41.3) 0.623 0.003 0.126
Chlorine (115, 115.6) (3.3, 2.8) 0.018 0.094 0.009
Alkaline phosphatase (524, 481) (26, 48) 0.318 0.000 0.000
Phosphorous (7.16, 7.20) (0.35, 0.57) 0.258 0.543 0.003
Iron (199, 201) (12, 14.5) 0.041 0.362 0.495
LDH (721, 699) (36, 79) 0.199 0.045 0.000
Magnesium (4.16, 4.0) (0.16, 0.38) 0.011 0.014 0.000
Sodium (156, 155.4) (2.67, 263) 0.360 0.061 0.185
Potassium (5.99, 6.02) (0.16, 0.18) 0.031 0.190 0.935
HDL cholesterol (97.7, 101.2) (4.8, 10.1) 0.014 0.001 0.000
Amylase (310, 312.5) (15.3, 29) 0.335 0.411 0.000
GGT (139, 133) (7, 17.8) 0.082 0.063 0.000
Direct bilirubin (1.64, 1.59) (0.12, 0.32) 0.597 0.279 0.000
Ionized Calcium (1.09, 1.09) (0.01, 0.1) 0.510 0.976 0.153
xpt and 𝜎pt correspond to the reference values. It is used a significance value 𝛼 = 5 % in all cases. The tests
are based on pathological data
true positives. In Table 2, we can see that 21 P-values from percentage of the cases at least one of the assumptions
McNemar’s test are lower than 𝛼 , that is, in an 81 % of cases required to apply PT based on CV is not valid. This may
(analytes) is rejected the hypothesis that the proportion of be one of the reasons why CV results are not satisfactory
participants with unsatisfactory performances according to (for some analytes). Total bilirubin, CK, iron, LDH and ion-
RV ( P1 ) is equal to the proportion of unsatisfactory perfor- ized calcium (normal data) and total proteins, sodium and
mance under CV ( P2 ) (equivalently is rejected the hypoth- ionized calcium (pathological data) are exceptions to the
esis that 𝜋 is equal to T + ). This is another clear indicator general pattern above mentioned (possibly due to the use in
that the scheme based on CV might be, for several analytes, these cases of equipment with cutting edge technology). In
inappropriate to identify true positives. many cases, CV assumptions are violated because, between
In Tables 4 and 5 are given (for each analyte) P-values others, the participants use different batches of calibrators,
corresponding to three tests (Lilliefors, one-sample t and technologies, reagents and standardization protocols. Also
chi-square). Each one of these is carried out with both nor- there are variations in preventive and corrective maintenance
mal (Table 4) and pathological values (Table 5) reported of equipment, and the procedures are highly influenced by
by participants at one round of a PT carried out in 2018. operational staff expertise.
The percentages of P-values (Lilliefors test, t test and 𝜒 2
test, respectively) lower than 𝛼 = 0.05 are 42.3 %, 57.7 %,
69.2 % (normal values) and 38.5 %, 46.2 %, and 65.4 % Conclusions
(pathological values). In 76.9 % (normal values) and 92.3 %
(pathological values) of the cases is rejected at least one The analysis of a large dataset of PT in clinical chemistry,
of the hypotheses. From these results, we can conclude obtained by PROASECAL SAS in Colombia, based on ref-
that in both cases (normal and pathological data) in a high erence values allows us to identify that laboratories have
13
Accreditation and Quality Assurance
in general suitable performances. The percentages of sat- 6. Hund E, Massart DL, Smeyers-Verbeke J (2000) Inter-laboratory
isfactory performance are in some cases (several analytes) studies in analytical chemistry. Anal Chim Acta 423(2):145–165
7. Wong S (2005) Evaluation of the use of consensus values in pro-
even greater than 90 %. However, a comparison of profi- ficiency testing programmes. Accred Qual Assur 10(8):409–414
ciency testing results obtained by consensus and reference 8. Baldan A, van der Veen AM, Prauß D, Recknagel A, Boley N,
values show discrepancies between these approaches when Evans S, Woods D (2001) Economy of proficiency testing: refer-
the laboratories have unsatisfactory results. This is possibly ence versus consensus values. Accred Qual Assur 6(4–5):164–167
9. Wong S (2007) A comparison of performance statistics for profi-
because some statistical assumptions to apply PT based on ciency testing programmes. Accred Qual Assur 12:59–66
consensus values are violated. The statistical analysis carried 10. ISO 13528 (2015) Statistical methods for use in proficiency testing
out indicate that, for several analytes, there are differences by interlaboratory comparison. Standard, International Organiza-
between the mean values ( x∗ and xpt ) and uncertainties meas- tion for Standardization, Geneva
11. Hollander M, Wolfe DA (1999) Nonparametric statistical meth-
ures ( s∗ and 𝜎pt ) of these schemes. Also in many cases, there ods. Wiley, London
is no statistical evidence of normality, which is a required 12. Zar J (1999) Biostatistical analysis. Pearson Education India,
assumption to apply a PT based on consensus. Many practi- Bengaluru
cal aspects do not evaluated in this work such as training of 13. Lalkhen AG, McCluskey A (2008) Clinical tests: sensitivity and
specificity. Contin Educ Anaesth Crit Care Pain 8(6):221–223
analysts or standardization of laboratories processes could 14. Reitsmaa J, Glasa A, Rutjesa A, Scholtenb R, Bossuyta P, Zwin-
be generating these differences dermana A (2005) Bivariate analysis of sensitivity and specificity
produces informative summary measures in diagnostic reviews. J
Acknowledgements We would like to thank PROASECAL SAS com- Clin Epidemiol 58:982–990
pany for providing us the dataset analyzed in the article. 15. Kim S, Lee W (2017) Does McNemar’s test compare the sensitivi-
ties and specificities of two diagnostic tests? Stat Methods Med
Res 26(1):142–154
16. Wong S (2016) Review of the new edition of ISO 13528. Accred
References Qual Assur 21(4):249–254
17. Willink R (2005) Forming a comparison reference value from
1. ISO/IEC 17043 (2010) Conformity assessment-general require- different distributions of belief. Metrologia 43(1):12
ments for proficiency testing. Standard, International Organization 18. Razali N, Wah Y (2011) Power comparisons of Shapiro-Wilk,
for Standardization, Geneva Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests. J
2. Medeiros de Albano F, Schwengber ten Caten C (2014) Profi- Stat Model Anal 2(1):21–33
ciency tests for laboratories: a systematic review. Accred Qual 19. Carobbi C (2017) A modified ISO 13528 robust analysis (algo-
Assur 19:245–257 rithm A) that takes measurement uncertainty into account. Meas-
3. Szewczak E, Bondarzewski A (2016) Is the assessment of inter- urement 110:296–306
laboratory comparison results for a small number of tests and
limited number of participants reliable and rational? Accred Qual Publisher’s Note Springer Nature remains neutral with regard to
Assur 21(2):91–100 jurisdictional claims in published maps and institutional affiliations.
4. Koch M, Baumeister F (2012) On the use of consensus means as
assigned values. Accred Qual Assur 17(4):395–398
5. PROASECAL SAS (2019) https://www.proasecal.com/
13