dgae528
dgae528
dgae528
https://doi.org/10.1210/clinem/dgae528
Advance access publication 31 July 2024
Meta-Analysis
Received: 19 April 2024. Editorial Decision: 28 July 2024. Corrected and Typeset: 26 August 2024
© The Author(s) 2024. Published by Oxford University Press on behalf of the Endocrine Society.
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs licence (https://creativecommons.
org/licenses/by-nc-nd/4.0/), which permits non-commercial reproduction and distribution of the work, in any medium, provided the original work is not altered
or transformed in any way, and that the work is properly cited. For commercial re-use, please contact reprints@oup.com for reprints and translation rights for
reprints. All other permissions can be obtained through our RightsLink service via the Permissions link on the article page on our site—for further information
please contact journals.permissions@oup.com. See the journal About page for additional terms.
e2152 The Journal of Clinical Endocrinology & Metabolism, 2024, Vol. 109, No. 11
Abstract
Background: Establishing local trimester-specific reference intervals for gestational TSH and free T4 (FT4) is often not feasible, necessitating
alternative strategies. We aimed to systematically quantify the diagnostic performance of standardized modifications of center-specific
nonpregnancy reference intervals as compared to trimester-specific reference intervals.
Methods: We included prospective cohorts participating in the Consortium on Thyroid and Pregnancy. After relevant exclusions, reference
intervals were calculated per cohort in thyroperoxidase antibody-negative women. Modifications to the nonpregnancy reference intervals
included an absolute modification (per .1 mU/L TSH or 1 pmol/L free T4), relative modification (in steps of 5%) and fixed limits (upper TSH
limit between 3.0 and 4.5 mU/L and lower FT4 limit 5-15 pmol/L). We compared (sub)clinical hypothyroidism prevalence, sensitivity, and
positive predictive value (PPV) of these methodologies with population-based trimester-specific reference intervals.
Results: The final study population comprised 52 496 participants in 18 cohorts. Optimal modifications of standard reference intervals to
diagnose gestational overt hypothyroidism were −5% for the upper limit of TSH and +5% for the lower limit of FT4 (sensitivity, .70, CI, 0.47-
0.86; PPV, 0.64, CI, 0.54-0.74). For subclinical hypothyroidism, these were −20% for the upper limit of TSH and −15% for the lower limit of
FT4 (sensitivity, 0.91; CI, 0.67-0.98; PPV, 0.71, CI, 0.58-0.80). Absolute and fixed modifications yielded similar results. CIs were wide, limiting
generalizability.
Conclusion: We could not identify modifications of nonpregnancy TSH and FT4 reference intervals that would enable centers to adequately
Thyroid dysfunction during pregnancy is associated with a obsolete while it takes account of the local assay and preex
higher risk of miscarriage, preeclampsia, preterm birth, aber isting laboratory harmonization efforts (19, 20). A useful
rant birthweight, and lower offspring IQ (1-6). Current inter diagnostic approach would need to fulfill certain conditions:
national guidelines recommend defining gestational thyroid (1) the diagnostic performance should at least perform better
dysfunction according to population and pregnancy-specific than currently recommended alternative methods (TSH
TSH and free T4 (FT4) reference intervals, to take into ac upper limit of 4.0 mU/L or subtraction of 0.5 mU/L) (12, 13)
count thyroid physiology during pregnancy, as well as differ and (2) the diagnostic performance should be reasonably con
ences in TSH and FT4 determinants between populations and sistent between populations.
the use of different laboratory assays (7-9). However, calcu In this individual participant meta-analysis, we aimed to
lating such local reference intervals is generally not feasible modify the center-specific nonpregnancy reference intervals
for most centers (10, 11). In addition to the practical hurdles, of TSH and FT4 in a standardized manner and study the sen
most of the published reference intervals for TSH and FT4 sitivity and the positive predictive value (PPV) compared to
are not in accordance with the current American Thyroid center-specific gestational reference intervals as calculated in
Association guidelines, as we recently exhibited by providing accordance with the current international guidelines.
an overview of published TSH and FT4 reference intervals
and methodologies, showing that most studies included
used additional exclusion criteria based on health status, did Methods
not exclude TPOAb positive participants or used different The study inclusion and eligibility procedures are described
percentile cutoffs (8). This is in part because of changing in detail previously (18). In short, eligible studies were those
guidelines and in part because many centers use additional ex participating in the Consortium on Thyroid and Pregnancy
clusion criteria or apply different reference limit cutoffs (8). (https://www.consortiumthyroidpregnancy.org). Exclusion
These varying methodologies hamper the adoption of refer criteria for participants were prepregnancy thyroid disease,
ence intervals from other centers, and as such, the vast pregnancy through in vitro fertilization/intracytoplasmic
majority of centers rely on nonpregnancy reference intervals sperm injection, use of thyroid (interfering) medication, and
for TSH with either a fixed limit approach (upper limit of multiple gestation. For this study, we followed the Preferred
4.0 mU/L for TSH) or a subtraction approach (subtraction Reporting Items for Systematic Reviews and Meta-Analyses
of 0.5 mU/L of the upper limit of TSH), whereas for FT4, guidelines for Individual Patient Data and preregistered the
varying local approaches are used including nonpregnancy study protocol (CRD42021270078), which can be found in
reference intervals (12-14). These second-tier strategies are the supplemental materials along with an outline of protocol
considered inferior compared to locally defined reference in deviations (21). Study quality and risk of bias were assessed
tervals (15-17). In a follow-up study, we showed that the using the Newcastle-Ottawa scale (Supplementary materials
use of a fixed upper TSH limit or the subtraction approach (21)). All cohorts were approved by a local review board
results in poor detection rates and high false-positive rates and acquired participant informed consent or had been
for (subclinical) hypothyroidism in early pregnancy with granted exemption from it by the local ethics committee.
highly variable diagnostic performance between populations
(sensitivity, 0.63-0.82; false discovery rate, 0.11-0.35) (18).
In search of a method that is both easy to implement in Defining Gestational Thyroid Dysfunction
clinical practice and would better identify women with an Nonpregnancy reference intervals were either published and/
abnormal thyroid function during pregnancy, we set out to or provided by the principal investigator of the included co
investigate if it is possible to modify the center-specific non horts and are assay-specific. We defined the trimesters as 0
pregnancy TSH and FT4 reference intervals so that these are to 13 weeks, > 13 to 27 weeks, and >27 weeks of gestation.
useful in pregnancy. Such an approach could make the estab For cohorts containing participants with repeated measure
lishment of local pregnancy-specific reference intervals ments, we used the first available sample for each trimester.
The Journal of Clinical Endocrinology & Metabolism, 2024, Vol. 109, No. 11 e2153
Reference intervals, thyroid dysfunction (overt and subclin values whereas CIs give an indication of where the mean value
ical hypothyroidism), and diagnostic test properties were lies. To facilitate comparison of diagnostic performance
calculated separately for each cohort to account for inter- markers between methods, interactive heatmaps were con
population differences. All reference intervals were calculated structed and can be found online (25).
as the 2.5th to 97.5th percentiles in TPOAb-negative partici
pants. Our primary aim was to optimize the diagnosis of thy
roid dysfunction states for which treatment is indicated or Statistical Analyses
should be considered based on current guidelines, and thus Diagnostic performance measures were calculated using 2 × 2
we limited analyses to overt and subclinical hypothyroidism contingency tables (confusion matrices) per cohort and pooled
(13). A treatment indication was defined as either (1) overt using random intercept logistic regression models using max
hypothyroidism, (2) subclinical hypothyroidism with TSH > imum likelihood for modeling between-study heterogeneity.
10 mU/L, or (3) subclinical hypothyroidism with TPOAb This approach was chosen because it outperforms conventional
positivity. A treatment consideration was defined as (1) TSH 2-step inverse-variance approaches for sparse event datasets
between 2.5 mU/L and the upper reference limit with con (26, 27). For each alternative approach, the sensitivity, PPV,
Figure 2. Diagnostic performance of modified nonpregnancy reference intervals for overt hypothyroidism using relative modification. Diagnostic
performance for relative modifications of nonpregnancy reference intervals for the diagnosis of overt hypothyroidism, presented as F-scores. The
zoomed-in section presents additional diagnostic performance markers for selected modifications, of which an interactive version can be found online
(https://www.consortiumthyroidpregnancy.org/heatmaps).
by supplemental tables (21) containing the diagnostic per Fig. 3D). Associated sensitivity (for lower limit
formance markers for each specific combination (Fig. 2 is an FT4, −4 pmol/L) was 0.91 (CI, 0.61-0.98; PI, 0.01-1.00; I2
explanatory example of the diagnostic markers presented). 95%) and PPV was 0.68 (CI, 0.55-0.78; PI, 0.20-0.95; I2
To facilitate comparison of diagnostic performance measures, 95%; Supplementary table 10 (21), Interactive figures (25)).
an interactive version of the heatmaps including other diag Using the fixed-limit approach in the first trimester, the
nostic performance measures can be found online and is also highest F-scores for overt hypothyroidism were achieved
referred to throughout, as an alternative to the supplemental with an upper limit of TSH of either 3.8, 3.9, 4.0, 4.1, and
tables (21) (https://www.consortiumthyroidpregnancy.org/ 4.4 mU/L and a lower limit of FT4 of 12 pmol/L (F-score,
heatmaps (25)). 0.65; Fig. 3E). Associated sensitivity (for upper limit TSH,
4.0 mU/L) was 0.83 (CI, 0.70-0.91; PI, 0.41-0.97; I2 0%)
and PPV was 0.50 (CI, 0.32-0.68; PI, 0.05-0.95; I2 70%;
Diagnostic Performance of Alternative Approaches Supplementary Table 11 (21), Interactive figures (25)). For
Using the relative modification approach in the first trimester, subclinical hypothyroidism the highest F-scores were achieved
the highest F-scores for overt hypothyroidism were achieved with an upper limit of TSH of 3.2 mU/L and a lower limit of
with a relative subtraction of 5% for the upper reference limit FT4 of either 5, 6, 7, or 8 pmol/L (F-score, 0.70; Fig. 3F).
of TSH and a relative addition of 5% for the lower reference Associated sensitivity (for lower limit FT4, 8 pmol/L) was
limit of FT4 (F-score 0.65; Fig. 3A). The associated sensitivity 0.99 (CI, 0.88-1.00; PI, 0.03-1.00; I2 91%) and PPV was
was 0.70 (95% CI, 0.47-0.86; 95% prediction interval [PI], 0.66 (CI, 0.51-0.79; PI, 0.11-0.97; I2 96%; Supplementary
0.06-0.99; I2 64%), and the PPV was 0.64 (CI, 0.54-0.74; Table 12 (21), Interactive figures (25)).
PI, 0.18-0.94; I2 45%; Fig. 3A, Supplementary Table 7 (21),
Interactive figures (25)). For subclinical hypothyroidism, the Additional Analyses
highest F-scores were achieved with a relative subtraction of In the second trimester, maximum F-scores were similar for
20% for the upper reference limit of TSH and a relative sub the relative modification method, the absolute modification
traction of 15% for the lower reference limit of FT4 approach and the fixed-limit approach (Supplementary
(F-score, 0.69; Fig. 3B). Associated sensitivity was 0.91 (CI, Fig. S1A-F (21)). However, comparing the diagnostic perform
0.67-0.98; PI, 0.02-1.00; I2 95%) and PPV was 0.71 (CI, ance measures of individual studies, the variability between
0.58-0.80; PI, 0.20-0.96; I2 95%; Supplementary Table 8 studies was very high, as reflected by overlapping CIs for all
(21), Interactive figures (25)). methods, based on the wide prediction intervals and based on
Using the absolute modification approach in the first trimes high I2 statistics for higher F-scores (Supplementary Tables
ter, the highest F-scores for overt hypothyroidism were 13-18 (21)). The diagnostic performance of alternative meth
achieved with a subtraction of either −0.1, −0.2, or −0.3 ods to detect women for whom levothyroxine treatment is indi
mU/L for the upper limit of TSH and an addition of cated and those for whom treatment should be considered,
+1 pmol/L to the lower limit of FT4 and (F-score, 0.62; according to American Thyroid Association guidelines, in the
Fig. 3C). Associated sensitivity (for upper limit TSH, −0.2 first trimester and second trimester were similar based on over
mU/L) was 0.74 (CI, 0.52-0.89; PI, 0.08-0.99; I2 66%) and lapping CIs (Supplementary Figs. S2 and 3; Supplementary
PPV was 0.57 (CI, 0.45-0.68; PI, 0.24-0.84; I2 39%; Tables 19-30 (21)).
Supplementary table 9 (21), Interactive figures (25)). For sub
clinical hypothyroidism, the highest F-scores were achieved
with a subtraction of −0.8 mU/L from the upper limit of Discussion
TSH and a subtraction of either −1, −2, −3, −4, In this study, we systematically evaluated multiple standar
or −5 pmol/L from the lower limit of FT4 (F-score, 0.64; dized procedures to modify nonpregnancy TSH and FT4
The Journal of Clinical Endocrinology & Metabolism, 2024, Vol. 109, No. 11 e2155
Figure 3. Diagnostic performance of modified nonpregnancy reference intervals for overt and subclinical hypothyroidism. Diagnostic performance of
modified nonpregnancy reference intervals are presented using a relative modification (A, B), absolute modifications (C, D), and fixed limits (E, F) for
overt and subclinical hypothyroidism, respectively, of which an interactive version can be found online (https://www.consortiumthyroidpregnancy.org/
heatmaps).
reference intervals with the aim of diagnosing the same indi achieved a satisfactory balance between sensitivity and PPV
viduals as having an abnormal gestational thyroid function for gestational thyroid dysfunction without considerable vari
in line with the “gold-standard” approach of center-specific ability across different populations. These results underscore
and trimester-specific reference intervals. Despite our efforts, the inherent challenge in balancing precise identification of
we were unable to identify a standardized procedure that gestational thyroid dysfunction with the practical limitations
e2156 The Journal of Clinical Endocrinology & Metabolism, 2024, Vol. 109, No. 11
of applying these diagnostic strategies universally in clinical diagnosing overt hypothyroidism, an entity with an evident
settings and indicate that calculating local center and treatment indication, is generally prioritized in diagnostic
pregnancy-specific reference intervals for TSH and FT4 strategies for gestational thyroid dysfunction. However, fail
should still be considered as current best practice. ing to identify the more prevalent subclinical disease could
Current recommendations on gestational reference interval also lead to decreased benefits of (selective) screening.
definitions for TSH and FT4 are time and resource consuming Although we found no method with an agreeable tradeoff in
and are not feasible for most centers worldwide. The modifi terms of diagnostic performance, it is important to realize
cation of nonpregnancy reference intervals for the use in preg that the interpretation of diagnostic performance of a test de
nancy could overcome feasibility problems. However, in the pends on the prior probability of disease (34). This is a highly
current study, we show that the variability in TSH and FT4 relevant concept when thinking about differences between
distributions leads to unacceptable variation in diagnostic per generalized population screening (with a low prior probabil
formance between cohorts. A possible explanation for this ity) vs high-risk case-based screening (with higher prior prob
variation is that even the nonpregnancy TSH and FT4 refer abilities). For example, for a hypothetical diagnostic test with
ence intervals are not an adequate reflection of the distribution a sensitivity of 0.75 and a specificity of 0.99 (roughly equal to
study, our group will aim to establish risk-based decision Development. L.C. received travel support by Pfizer. S.M.N.
limits. has received consultancy, speakers’ fees, or travel
In this study, we were able to leverage a large international support from Access Fertility, Beckman Coulter, Ferring
dataset of multiple population-based prospective cohort stud Pharmaceuticals, Merck, Modern Fertility, Roche
ies to assess novel strategies for diagnosing thyroid dysfunc Diagnostics, and The Fertility Partnership. S.M.N. also
tion in pregnancy. The interpretation of the results of this reports payments for medical–legal work and investment in
study are limited to populations with sufficient or The Fertility Partnership. T.I.M.K. reports lectureship fees
mild-to-moderate iodine deficiency because studies with ex from Berlin-Chemie, Goodlife Healthcare, Institut
cessive status were excluded and no studies were performed Biochimique SA, Merck, and Quidel. U.F.R.’s research salary
in an area of severe iodine deficiency. Additionally, multiple was sponsored by an unrestricted grant from Kirsten and
differences between the included study populations, including Freddy Johansen’s Fund and U.F.R. reports lecture fees
differences in iodine supplementation, assays, and determi from Merck, Darmstadt. S.B.’s research salary was sponsored
nants of thyroid function tests, could have contributed to by the Capital Region of Denmark’s Research Foundation
the variability in diagnostic performance of the nonpregnancy and the Novo Nordisk Foundation (ID 0077221). S.B. re
12. Lazarus J, Brown RS, Daumerie C, Hubalewska-Dydejczyk A, 26. Lin L, Chu H. Meta-analysis of proportions using generalized linear
Negro R, Vaidya B. 2014 European thyroid association guidelines mixed models. Epidemiology. 2020;31(5):713-717.
for the management of subclinical hypothyroidism in pregnancy 27. Stijnen T, Hamza TH, Ozdemir P. Random effects meta-analysis of
and in children. Eur Thyroid J. 2014;3(2):76-94. event outcome in the framework of the generalized linear mixed
13. Alexander EK, Pearce EN, Brent GA, et al. 2017 guidelines of the model with applications in sparse data. Stat Med. 2010;29(29):
American thyroid association for the diagnosis and management 3046-3067.
of thyroid disease during pregnancy and the postpartum. 28. R Core Team. R: A Language and Environment for Statistical
Thyroid. 2017;27(3):315-389. Computing. R Foundation for Statistical Computing; 2022.
14. Thyroid disease in pregnancy: ACOG practice bulletin, number 29. Balduzzi S, Rucker G, Schwarzer G. How to perform a meta-
223. Obstet Gynecol. 2020;135(6):e261-e274. analysis with R: a practical tutorial. Evid Based Ment Health.
15. Bliddal S, Feldt-Rasmussen U, Boas M, et al. Gestational 2019;22(4):153-160.
age-specific reference ranges from different laboratories misclas 30. Wickham H. ggplot2: Elegant Graphics for Data Analysis.
sify pregnant women’s thyroid status: comparison of two longitu Springer-Verlag New York; 2009.
dinal prospective cohort studies. Eur J Endocrinol. 2014;170(2): 31. Galili T, O’Callaghan A, Sidi J, Sievert C. Heatmaply: an R package
329-339. for creating interactive cluster heatmaps for online publishing.