dgae528

The Journal of Clinical Endocrinology & Metabolism, 2024, 109, e2151–e2158
https://doi.org/10.1210/clinem/dgae528
Advance access publication 31 July 2024
Meta-Analysis
Defining Gestational Thyroid Dysfunction Through

Modified Nonpregnancy Reference Intervals: An Individual
Participant Meta-analysis
Joris A. J. Osinga,1,2 Scott M. Nelson,3 John P. Walsh,4,5 Ghalia Ashoor,6 Glenn E. Palomaki,7
Downloaded from https://academic.oup.com/jcem/article/109/11/e2151/7724966 by guest on 31 October 2024

Abel López-Bermejo,8,9 Judit Bassols,10 Ashraf Aminorroaya,11 Maarten A. C. Broeren,12
Liangmiao Chen,13 Xuemian Lu,13 Suzanne J. Brown,4 Flora Veltri,14 Kun Huang,15
Tuija Männistö,16 Marina Vafeiadi,17 Peter N. Taylor,18 Fang-Biao Tao,15 Lida Chatzi,19
Maryam Kianpour,11 Eila Suvanto,20 Elena N. Grineva,21 Kypros H. Nicolaides,22 Mary E. D’Alton,23
Kris G. Poppe,14 Erik Alexander,24 Ulla Feldt-Rasmussen,25,26 Sofie Bliddal,25,26
Polina V. Popova,21 Layal Chaker,1,2,27 W. Edward Visser,1,2 Robin P. Peeters,1,2
Arash Derakhshan,1,2 Tanja G. M. Vrijkotte,28 Victor J. M. Pop,29 and Tim I. M. Korevaar1,2
1
Department of Internal Medicine, Erasmus University Medical Center, 3000 CA Rotterdam, the Netherlands
2
Academic Center for Thyroid Diseases, Erasmus University Medical Center, 3000 CA Rotterdam, the Netherlands
3
School of Medicine, Dentistry and Nursing, University of Glasgow, G12 8QQ Glasgow, UK
4
Department of Endocrinology and Diabetes, Sir Charles Gairdner Hospital, Nedlands, WA 6009, Australia
5
Medical School, University of Western Australia, Crawley, WA 6009, Australia
6
Harris Birthright Research Center for Fetal Medicine, King’s College Hospital, SE5 9RS London, UK
7
Department of Pathology and Laboratory Medicine, Women & Infants Hospital and Alpert Medical School at Brown University, RI 02903
Providence, USA
8
Pediatric Endocrinology Research Group, Girona Biomedical Research Institute (IDIBGI), Dr. Josep Trueta Hospital, 17007 Girona, Spain
9
Departament de Ciències Mèdiques, Universitat de Girona, 17003 Girona, Spain
10
Maternal-Fetal Metabolic Research Group, Girona Biomedical Research Institute (IDIBGI), Dr. Josep Trueta Hospital, 17007 Girona, Spain
11
Isfahan Endocrine and Metabolism Research Center, Isfahan University of Medical Sciences, 81745-33871 Isfahan, Iran
12
Laboratory of Clinical Chemistry and Haematology, Máxima Medical Centre, 5504 DB Veldhoven, Netherlands
13
Department of Endocrinology and Rui’an Center of the Chinese-American Research Institute for Diabetic Complications, Third Affiliated
Hospital of Wenzhou Medical University, 325035 Wenzhou, China
14
Endocrine Unit, Centre Hospitalier Universitaire Saint-Pierre, Université Libre de Bruxelles (ULB), 1000 Brussels, Belgium
15
Department of Maternal, Child and Adolescent Health, Scientific Research Center in Preventive Medicine, School of Public Health, Anhui
Medical University, 230032 Anhui, China
16
NordLab, Oulu and Translational Medicine Research Unit, University of Oulu, 90570 Oulu, Finland
17
Department of Social Medicine, School of Medicine, University of Crete, 710 03 Heraklion, Crete, Greece
18
Thyroid Research Group, Systems Immunity Research Institute, Cardiff University School of Medicine, CF10 3EU Cardiff, UK
19
Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA 90089, USA
20
Department of Obstetrics and Gynecology and Medical Research Center Oulu, University of Oulu, 90570 Oulu, Finland
21
Institute of Endocrinology, Almazov National Medical Research Centre, 197341 Saint Petersburg, Russia
22
Department of Women and Children’s Health, Faculty of Life Sciences and Medicine King’s College London, SE5 9RS London, UK
23
Department of Obstetrics and Gynecology, Columbia University Irving Medical Center, New York, NY 10032, USA
24
Division of Endocrinology, Hypertension and Diabetes, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA
25
Department of Medical Endocrinology and Metabolism, Copenhagen University Hospital, Rigshospitalet, 2100 Copenhagen, Denmark
26
Department of Clinical Medicine, Faculty of Health and clinical Sciences, Copenhagen University, 2100 Copenhagen, Denmark
27
Department of Epidemiology, Erasmus University Medical Center, 3000 CA Rotterdam, the Netherlands
28
Department of Public and Occupational Health, Amsterdam UMC, University of Amsterdam, Amsterdam Public Health Research Institute,
1081 HV Amsterdam, the Netherlands
29
Department of Medical and Clinical Psychology, Tilburg University, 5000 LE Tilburg, the Netherlands
Correspondence: Joris Osinga, MD, Generation R, Wytemaweg 80, 3000 CA Rotterdam, the Netherlands. Email: j.osinga@erasmusmc.nl.
Received: 19 April 2024. Editorial Decision: 28 July 2024. Corrected and Typeset: 26 August 2024
© The Author(s) 2024. Published by Oxford University Press on behalf of the Endocrine Society.
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs licence (https://creativecommons.
org/licenses/by-nc-nd/4.0/), which permits non-commercial reproduction and distribution of the work, in any medium, provided the original work is not altered
or transformed in any way, and that the work is properly cited. For commercial re-use, please contact reprints@oup.com for reprints and translation rights for
reprints. All other permissions can be obtained through our RightsLink service via the Permissions link on the article page on our site—for further information
please contact journals.permissions@oup.com. See the journal About page for additional terms.
e2152 The Journal of Clinical Endocrinology & Metabolism, 2024, Vol. 109, No. 11
Abstract
Background: Establishing local trimester-specific reference intervals for gestational TSH and free T4 (FT4) is often not feasible, necessitating
alternative strategies. We aimed to systematically quantify the diagnostic performance of standardized modifications of center-specific
nonpregnancy reference intervals as compared to trimester-specific reference intervals.
Methods: We included prospective cohorts participating in the Consortium on Thyroid and Pregnancy. After relevant exclusions, reference
intervals were calculated per cohort in thyroperoxidase antibody-negative women. Modifications to the nonpregnancy reference intervals
included an absolute modification (per .1 mU/L TSH or 1 pmol/L free T4), relative modification (in steps of 5%) and fixed limits (upper TSH
limit between 3.0 and 4.5 mU/L and lower FT4 limit 5-15 pmol/L). We compared (sub)clinical hypothyroidism prevalence, sensitivity, and
positive predictive value (PPV) of these methodologies with population-based trimester-specific reference intervals.
Results: The final study population comprised 52 496 participants in 18 cohorts. Optimal modifications of standard reference intervals to
diagnose gestational overt hypothyroidism were −5% for the upper limit of TSH and +5% for the lower limit of FT4 (sensitivity, .70, CI, 0.47-
0.86; PPV, 0.64, CI, 0.54-0.74). For subclinical hypothyroidism, these were −20% for the upper limit of TSH and −15% for the lower limit of
FT4 (sensitivity, 0.91; CI, 0.67-0.98; PPV, 0.71, CI, 0.58-0.80). Absolute and fixed modifications yielded similar results. CIs were wide, limiting
generalizability.
Conclusion: We could not identify modifications of nonpregnancy TSH and FT4 reference intervals that would enable centers to adequately

approximate trimester-specific reference intervals. Future efforts should be turned toward studying the meaningfulness of trimester-specific
reference intervals and risk-based decision limits.
Key Words: thyroid gland, thyroid function tests, reference values, pregnancy, thyrotropin, thyroxine
Abbreviations: FT4, free T4; PI, prediction interval; PPV, positive predictive value.
Thyroid dysfunction during pregnancy is associated with a obsolete while it takes account of the local assay and preex
higher risk of miscarriage, preeclampsia, preterm birth, aber isting laboratory harmonization efforts (19, 20). A useful
rant birthweight, and lower offspring IQ (1-6). Current inter diagnostic approach would need to fulfill certain conditions:
national guidelines recommend defining gestational thyroid (1) the diagnostic performance should at least perform better
dysfunction according to population and pregnancy-specific than currently recommended alternative methods (TSH
TSH and free T4 (FT4) reference intervals, to take into ac upper limit of 4.0 mU/L or subtraction of 0.5 mU/L) (12, 13)
count thyroid physiology during pregnancy, as well as differ and (2) the diagnostic performance should be reasonably con
ences in TSH and FT4 determinants between populations and sistent between populations.
the use of different laboratory assays (7-9). However, calcu In this individual participant meta-analysis, we aimed to
lating such local reference intervals is generally not feasible modify the center-specific nonpregnancy reference intervals
for most centers (10, 11). In addition to the practical hurdles, of TSH and FT4 in a standardized manner and study the sen
most of the published reference intervals for TSH and FT4 sitivity and the positive predictive value (PPV) compared to
are not in accordance with the current American Thyroid center-specific gestational reference intervals as calculated in
Association guidelines, as we recently exhibited by providing accordance with the current international guidelines.
an overview of published TSH and FT4 reference intervals
and methodologies, showing that most studies included
used additional exclusion criteria based on health status, did Methods
not exclude TPOAb positive participants or used different The study inclusion and eligibility procedures are described
percentile cutoffs (8). This is in part because of changing in detail previously (18). In short, eligible studies were those
guidelines and in part because many centers use additional ex participating in the Consortium on Thyroid and Pregnancy
clusion criteria or apply different reference limit cutoffs (8). (https://www.consortiumthyroidpregnancy.org). Exclusion
These varying methodologies hamper the adoption of refer criteria for participants were prepregnancy thyroid disease,
ence intervals from other centers, and as such, the vast pregnancy through in vitro fertilization/intracytoplasmic
majority of centers rely on nonpregnancy reference intervals sperm injection, use of thyroid (interfering) medication, and
for TSH with either a fixed limit approach (upper limit of multiple gestation. For this study, we followed the Preferred
4.0 mU/L for TSH) or a subtraction approach (subtraction Reporting Items for Systematic Reviews and Meta-Analyses
of 0.5 mU/L of the upper limit of TSH), whereas for FT4, guidelines for Individual Patient Data and preregistered the
varying local approaches are used including nonpregnancy study protocol (CRD42021270078), which can be found in
reference intervals (12-14). These second-tier strategies are the supplemental materials along with an outline of protocol
considered inferior compared to locally defined reference in deviations (21). Study quality and risk of bias were assessed
tervals (15-17). In a follow-up study, we showed that the using the Newcastle-Ottawa scale (Supplementary materials
use of a fixed upper TSH limit or the subtraction approach (21)). All cohorts were approved by a local review board
results in poor detection rates and high false-positive rates and acquired participant informed consent or had been
for (subclinical) hypothyroidism in early pregnancy with granted exemption from it by the local ethics committee.
highly variable diagnostic performance between populations
(sensitivity, 0.63-0.82; false discovery rate, 0.11-0.35) (18).
In search of a method that is both easy to implement in Defining Gestational Thyroid Dysfunction
clinical practice and would better identify women with an Nonpregnancy reference intervals were either published and/
abnormal thyroid function during pregnancy, we set out to or provided by the principal investigator of the included co
investigate if it is possible to modify the center-specific non horts and are assay-specific. We defined the trimesters as 0
pregnancy TSH and FT4 reference intervals so that these are to 13 weeks, > 13 to 27 weeks, and >27 weeks of gestation.
useful in pregnancy. Such an approach could make the estab For cohorts containing participants with repeated measure
lishment of local pregnancy-specific reference intervals ments, we used the first available sample for each trimester.
The Journal of Clinical Endocrinology & Metabolism, 2024, Vol. 109, No. 11 e2153
Reference intervals, thyroid dysfunction (overt and subclin values whereas CIs give an indication of where the mean value
ical hypothyroidism), and diagnostic test properties were lies. To facilitate comparison of diagnostic performance
calculated separately for each cohort to account for inter- markers between methods, interactive heatmaps were con
population differences. All reference intervals were calculated structed and can be found online (25).
as the 2.5th to 97.5th percentiles in TPOAb-negative partici
pants. Our primary aim was to optimize the diagnosis of thy
roid dysfunction states for which treatment is indicated or Statistical Analyses
should be considered based on current guidelines, and thus Diagnostic performance measures were calculated using 2 × 2
we limited analyses to overt and subclinical hypothyroidism contingency tables (confusion matrices) per cohort and pooled
(13). A treatment indication was defined as either (1) overt using random intercept logistic regression models using max
hypothyroidism, (2) subclinical hypothyroidism with TSH > imum likelihood for modeling between-study heterogeneity.
10 mU/L, or (3) subclinical hypothyroidism with TPOAb This approach was chosen because it outperforms conventional
positivity. A treatment consideration was defined as (1) TSH 2-step inverse-variance approaches for sparse event datasets
between 2.5 mU/L and the upper reference limit with con (26, 27). For each alternative approach, the sensitivity, PPV,

comitant TPOAb positivity or (2) subclinical hypothyroidism and F-scores were calculated and compared with the trimester-
without TPOAb positivity (13). Treatment of hyperthyroid specific approach. All analyses were performed using R statis
ism was outside the scope of this study because gestational tical software version 4.2.2 (28), specifically using the packages
hyperthyroidism is often considered physiological and we do “meta” (29), “ggplot2,” (30) and “heatmaply” (31).
not have data available to differentiate between gestational
transient thyrotoxicosis and Graves’ hyperthyroidism (13).
The prevalence of thyroid dysfunction and diagnostic per Results
formance measures were calculated according to several meth After exclusions, the final study population comprised 52 496
ods: (1) a relative modification of the nonpregnancy upper participants included in 18 cohorts (Fig. 1), of whom 8.6%
limit of TSH varying from −5% to −40% in steps of 5%, were TPOAb positive (range across cohorts 5.7-17.1%;
with modifications to the lower limit of FT4 varying from Supplementary Table 1 (21)). The prevalence of thyroid func
−20% to +20% in steps of 5% (relative modification ap tion test abnormalities (in the first and second trimester, re
proach); (2) a subtraction from the nonpregnancy upper limit spectively) according to the trimester-specific approach was
of TSH varying from −0.1 to −1.0 mU/L, with modification of 0.5% and 0.3% for overt hypothyroidism and 3.4% and
the nonpregnancy lower limit of FT4 varying from −5 to 3.2% for subclinical hypothyroidism. The inclusion process
+5 pmol/L (−0.39 to +0.39 ng/dL; absolute modification ap and maternal demographics are described in detail previously
proach); and (3) using fixed upper limits for TSH, varying (18). Cohort-specific prevalence of thyroid disease, reference
from 3.0 to 4.5 mU/L, and fixed lower limits for FT4, varying limits, iodine status, and assay information can be found in
from 5 to 15 pmol/L (0.39-1.17 ng/dL; fixed limit approach). Supplementary Tables 2-6 (21). All figures are accompanied
The choice for the range of modifications was based on previ
ous recommendations (eg, the fixed upper limit of 4.0 mU/L
for TSH and 0.5 subtraction from this limit) and the optimal
diagnostic performance in this study to keep the results organ
ized. The results for each method were compared to the refer
ence standard (trimester-specific reference intervals), as is
currently advised in international guidelines (12, 13).
Diagnostic Performance Measures

The diagnostic performance of each assessed combination is
described using the sensitivity (equivalent to true-positive
rate, true-positive rate among all with the disease according
to the trimester-specific method) and the PPV (equivalent to
1-false discovery rate, true positives among all with a positive
test result). Presenting the PPV, rather than the specificity, was
preferred because the PPV is more informative with regard to
false positives for outcomes with a low prevalence (22). The
aim was to maximize both diagnostic performance markers,
which poses a challenge because maximizing sensitivity and
the PPV is often a tradeoff.
The primary outcome was a single diagnostic performance
measure, the F-score (also referred to as F1-score), which is a
combined measure of PPV (also referred to as “precision”) and
sensitivity (also referred to as “recall”) (23). A higher F-score
denotes a better overall diagnostic performance.
Prediction intervals and the I2 statistic are presented to illus
trate the expected inter-population variation in diagnostic
performance and between-study heterogeneity (21, 24). Figure 1. Flowchart of included cohorts and participants. Reprinted by
Prediction intervals are an attempt to predict future individual Osinga et al (18).
F-Score Modificaon Sensivity Posive Predicve Value

Confidence Predicon I2 Confidence Predicon I2
ULTSH LLFT4
interval interval Stasc interval interval Stasc
0.54 NoneNone 0.49 0.33-0.65 0.08-0.91 59% 0.65 0.45-0.81 0.13-0.96 26%
0.54 -5% None 0.53 0.32-0.73 0.05-0.96 48% 0.65 0.46-0.81 0.15-0.95 17%
0.54 -10% None 0.54 0.34-0.73 0.05-0.96 44% 0.63 0.44-0.79 0.14-0.95 30%
0.64 None+5% 0.65 0.46-0.80 0.10-0.97 67% 0.65 0.54-0.74 0.17-0.94 48%
0.65 -5% +5% 0.70 0.47-0.86 0.06-0.99 64% 0.64 0.54-0.74 0.18-0.94 45%
0.64 -10% +5% 0.71 0.49-0.86 0.07-0.99 64% 0.62 0.46-0.76 0.17-0.93 55%
0.61 None+10% 0.75 0.50-0.91 0.05-0.99 51% 0.56 0.45-0.67 0.25-0.83 35%
0.61 -5% +10% 0.83 0.53-0.95 0.03-1.00 44% 0.55 0.44-0.66 0.25-0.82 33%

0.60 -10% +10% 0.84 0.55-0.95 0.03-1.00 46% 0.53 0.42-0.65 0.22-0.82 39%
Figure 2. Diagnostic performance of modified nonpregnancy reference intervals for overt hypothyroidism using relative modification. Diagnostic
performance for relative modifications of nonpregnancy reference intervals for the diagnosis of overt hypothyroidism, presented as F-scores. The
zoomed-in section presents additional diagnostic performance markers for selected modifications, of which an interactive version can be found online
(https://www.consortiumthyroidpregnancy.org/heatmaps).
by supplemental tables (21) containing the diagnostic per Fig. 3D). Associated sensitivity (for lower limit
formance markers for each specific combination (Fig. 2 is an FT4, −4 pmol/L) was 0.91 (CI, 0.61-0.98; PI, 0.01-1.00; I2
explanatory example of the diagnostic markers presented). 95%) and PPV was 0.68 (CI, 0.55-0.78; PI, 0.20-0.95; I2
To facilitate comparison of diagnostic performance measures, 95%; Supplementary table 10 (21), Interactive figures (25)).
an interactive version of the heatmaps including other diag Using the fixed-limit approach in the first trimester, the
nostic performance measures can be found online and is also highest F-scores for overt hypothyroidism were achieved
referred to throughout, as an alternative to the supplemental with an upper limit of TSH of either 3.8, 3.9, 4.0, 4.1, and
tables (21) (https://www.consortiumthyroidpregnancy.org/ 4.4 mU/L and a lower limit of FT4 of 12 pmol/L (F-score,
heatmaps (25)). 0.65; Fig. 3E). Associated sensitivity (for upper limit TSH,
4.0 mU/L) was 0.83 (CI, 0.70-0.91; PI, 0.41-0.97; I2 0%)
and PPV was 0.50 (CI, 0.32-0.68; PI, 0.05-0.95; I2 70%;
Diagnostic Performance of Alternative Approaches Supplementary Table 11 (21), Interactive figures (25)). For
Using the relative modification approach in the first trimester, subclinical hypothyroidism the highest F-scores were achieved
the highest F-scores for overt hypothyroidism were achieved with an upper limit of TSH of 3.2 mU/L and a lower limit of
with a relative subtraction of 5% for the upper reference limit FT4 of either 5, 6, 7, or 8 pmol/L (F-score, 0.70; Fig. 3F).
of TSH and a relative addition of 5% for the lower reference Associated sensitivity (for lower limit FT4, 8 pmol/L) was
limit of FT4 (F-score 0.65; Fig. 3A). The associated sensitivity 0.99 (CI, 0.88-1.00; PI, 0.03-1.00; I2 91%) and PPV was
was 0.70 (95% CI, 0.47-0.86; 95% prediction interval [PI], 0.66 (CI, 0.51-0.79; PI, 0.11-0.97; I2 96%; Supplementary
0.06-0.99; I2 64%), and the PPV was 0.64 (CI, 0.54-0.74; Table 12 (21), Interactive figures (25)).
PI, 0.18-0.94; I2 45%; Fig. 3A, Supplementary Table 7 (21),
Interactive figures (25)). For subclinical hypothyroidism, the Additional Analyses
highest F-scores were achieved with a relative subtraction of In the second trimester, maximum F-scores were similar for
20% for the upper reference limit of TSH and a relative sub the relative modification method, the absolute modification
traction of 15% for the lower reference limit of FT4 approach and the fixed-limit approach (Supplementary
(F-score, 0.69; Fig. 3B). Associated sensitivity was 0.91 (CI, Fig. S1A-F (21)). However, comparing the diagnostic perform
0.67-0.98; PI, 0.02-1.00; I2 95%) and PPV was 0.71 (CI, ance measures of individual studies, the variability between
0.58-0.80; PI, 0.20-0.96; I2 95%; Supplementary Table 8 studies was very high, as reflected by overlapping CIs for all
(21), Interactive figures (25)). methods, based on the wide prediction intervals and based on
Using the absolute modification approach in the first trimes high I2 statistics for higher F-scores (Supplementary Tables
ter, the highest F-scores for overt hypothyroidism were 13-18 (21)). The diagnostic performance of alternative meth
achieved with a subtraction of either −0.1, −0.2, or −0.3 ods to detect women for whom levothyroxine treatment is indi
mU/L for the upper limit of TSH and an addition of cated and those for whom treatment should be considered,
+1 pmol/L to the lower limit of FT4 and (F-score, 0.62; according to American Thyroid Association guidelines, in the
Fig. 3C). Associated sensitivity (for upper limit TSH, −0.2 first trimester and second trimester were similar based on over
mU/L) was 0.74 (CI, 0.52-0.89; PI, 0.08-0.99; I2 66%) and lapping CIs (Supplementary Figs. S2 and 3; Supplementary
PPV was 0.57 (CI, 0.45-0.68; PI, 0.24-0.84; I2 39%; Tables 19-30 (21)).
Supplementary table 9 (21), Interactive figures (25)). For sub
clinical hypothyroidism, the highest F-scores were achieved
with a subtraction of −0.8 mU/L from the upper limit of Discussion
TSH and a subtraction of either −1, −2, −3, −4, In this study, we systematically evaluated multiple standar
or −5 pmol/L from the lower limit of FT4 (F-score, 0.64; dized procedures to modify nonpregnancy TSH and FT4
Figure 3. Diagnostic performance of modified nonpregnancy reference intervals for overt and subclinical hypothyroidism. Diagnostic performance of
modified nonpregnancy reference intervals are presented using a relative modification (A, B), absolute modifications (C, D), and fixed limits (E, F) for
overt and subclinical hypothyroidism, respectively, of which an interactive version can be found online (https://www.consortiumthyroidpregnancy.org/
heatmaps).
reference intervals with the aim of diagnosing the same indi achieved a satisfactory balance between sensitivity and PPV
viduals as having an abnormal gestational thyroid function for gestational thyroid dysfunction without considerable vari
in line with the “gold-standard” approach of center-specific ability across different populations. These results underscore
and trimester-specific reference intervals. Despite our efforts, the inherent challenge in balancing precise identification of
we were unable to identify a standardized procedure that gestational thyroid dysfunction with the practical limitations
of applying these diagnostic strategies universally in clinical diagnosing overt hypothyroidism, an entity with an evident
settings and indicate that calculating local center and treatment indication, is generally prioritized in diagnostic
pregnancy-specific reference intervals for TSH and FT4 strategies for gestational thyroid dysfunction. However, fail
should still be considered as current best practice. ing to identify the more prevalent subclinical disease could
Current recommendations on gestational reference interval also lead to decreased benefits of (selective) screening.
definitions for TSH and FT4 are time and resource consuming Although we found no method with an agreeable tradeoff in
and are not feasible for most centers worldwide. The modifi terms of diagnostic performance, it is important to realize
cation of nonpregnancy reference intervals for the use in preg that the interpretation of diagnostic performance of a test de
nancy could overcome feasibility problems. However, in the pends on the prior probability of disease (34). This is a highly
current study, we show that the variability in TSH and FT4 relevant concept when thinking about differences between
distributions leads to unacceptable variation in diagnostic per generalized population screening (with a low prior probabil
formance between cohorts. A possible explanation for this ity) vs high-risk case-based screening (with higher prior prob
variation is that even the nonpregnancy TSH and FT4 refer abilities). For example, for a hypothetical diagnostic test with
ence intervals are not an adequate reflection of the distribution a sensitivity of 0.75 and a specificity of 0.99 (roughly equal to

of thyroid function tests for a population if they are based on the tests assessed in our study), a pretest probability of 3%
the manufacturer’s recommendation rather than local would result in a postpositive test probability (or PPV) of
laboratory-specific establishment of the intervals. Methods 70% and a false discovery rate of 30%. Using the same sensi
for determining reference intervals in pregnancy and outside tivity and specificity, a pretest probability of 10% would re
pregnancy often differ because current recommendations on sult in a postpositive test probability of 89% with a false
establishing reference limits in pregnancy include the local discovery rate of 11%. The current study population consists
population and are by definition a reflection of local TSH of population-based cohort studies as a reflection of the gen
and FT4 distributions (12-14), whereas reference limits out eral population, which have a low prior probability of disease
side pregnancy are often supplied by the assay manufacturer, equal to the population prevalence and similar to a universal
who mostly established reference intervals in selected, non screening approach. One option to improve how alternative
pregnant populations (32, 33). Global harmonization efforts reference interval strategies could identify those with an ab
for TSH and FT4 assays by the International Federation of normal thyroid function would be to increase the prior prob
Clinical Chemistry and Laboratory Medicine Committee for ability of disease (34). This can be achieved by optimizing the
Standardization of Thyroid Function Tests are ongoing to ad identification of high-risk subgroups and a risk-based screen
dress this issue outside of pregnancy, which could lead to an ing approach, which could improve the accuracy of diagnostic
attenuation of this mismatch (19, 20). strategies (35). Thus, the implementation of universal screen
We also show that for overt hypothyroidism and subclinical ing will be inherently associated with the lowest prior prob
hypothyroidism, different and sometimes opposing modifica ability of disease and the highest rates of both over and
tions of the reference limits of TSH and FT4 were needed to underdiagnosis, especially if alternative strategies are used to
achieve maximum diagnostic performance. For instance, define thyroid function test abnormalities.
when reviewing the relative modifications needed to achieve The heterogeneity between populations (as denoted by wide
the best diagnostic performance for overt hypothyroidism in prediction intervals and high I2 statistics) underline that calcu
the first trimester, we find that the best F-score of 0.65 is lating local center and pregnancy-specific reference intervals
achieved with the upper limit of TSH −5% and the lower limit for TSH and FT4 should still be considered as current best
of FT4 + 5% (Fig. 2A), whereas the best F-score for subclinical practice. However, other strategies for the improvement of
hypothyroidism of 0.69 is achieved with the upper limit of the diagnosis of gestational thyroid dysfunction might prove
TSH −20% and the lower limit of FT4 −15% (Fig. 2B). We more effective. The trimester-specific approach is currently ac
previously showed that the use of trimester-specific reference cepted as the best diagnostic method for diagnosing thyroid
intervals for FT4 are most important for the correct diagnosis dysfunction in pregnancy, but the pragmatic division of the
of overt hypothyroidism, whereas for subclinical hypothy gestational period in trimesters does not necessarily reflect
roidism, the use of trimester-specific reference intervals for the physiological changes of thyroid function tests during
TSH are more important (18), which could explain the current pregnancy (36-38). Further studies are needed to assess which
results. This finding suggests that a uniform rule established to gestational period reference intervals should be based on to
diagnose both overt and subclinical disease would be good at optimally identify the women at increased risk of adverse
diagnosing one at the cost of incorrectly diagnosing the other. events because of thyroid dysfunction, or if any form of stand
We also observe that the trends in diagnostic performance for ardization to gestational age should be abandoned altogether.
a treatment indication (Supplementary Fig. S2A and C, 2E Current reference interval definitions are based on outlying
(21)) mostly overlap with the trend in diagnostic performance percentiles of TSH and FT4 distributions (2.5th and 97.5th
for subclinical hypothyroidism (Fig. 2B and D, 2F). This is be percentiles), values above or below those cutoffs were later
cause most women with a treatment indication present with shown to be associated with adverse pregnancy outcomes
subclinical hypothyroidism with TPOAb positivity (73.6%) (39). With increasing data availability in the literature, the
rather than overt hypothyroidism (25.4%) or subclinical ideal way to establish reference values would be to turn this
hypothyroidism with TSH > 10 (1.1%; data not shown). methodology around and base the cutoffs on the risk of ad
Because the prevalence of subclinical hypothyroidism is verse outcomes, similar to other fields (40, 41). Obvious ad
much higher than of overt hypothyroidism, it can be expected verse pregnancy events would be those associated with
that the best diagnostic performance of a test to detect a treat thyroid function tests in previous studies such as preterm birth
ment indication is reached with the same modifications as for and offspring IQ scores (3, 4, 6). Because we did not identify
subclinical hypothyroidism. This concept is important for fu an adequate or easily implementable methodology to ap
ture recommendations on universal reference limits because proach trimester-specific reference intervals in the current
study, our group will aim to establish risk-based decision Development. L.C. received travel support by Pfizer. S.M.N.
limits. has received consultancy, speakers’ fees, or travel
In this study, we were able to leverage a large international support from Access Fertility, Beckman Coulter, Ferring
dataset of multiple population-based prospective cohort stud Pharmaceuticals, Merck, Modern Fertility, Roche
ies to assess novel strategies for diagnosing thyroid dysfunc Diagnostics, and The Fertility Partnership. S.M.N. also
tion in pregnancy. The interpretation of the results of this reports payments for medical–legal work and investment in
study are limited to populations with sufficient or The Fertility Partnership. T.I.M.K. reports lectureship fees
mild-to-moderate iodine deficiency because studies with ex from Berlin-Chemie, Goodlife Healthcare, Institut
cessive status were excluded and no studies were performed Biochimique SA, Merck, and Quidel. U.F.R.’s research salary
in an area of severe iodine deficiency. Additionally, multiple was sponsored by an unrestricted grant from Kirsten and
differences between the included study populations, including Freddy Johansen’s Fund and U.F.R. reports lecture fees
differences in iodine supplementation, assays, and determi from Merck, Darmstadt. S.B.’s research salary was sponsored
nants of thyroid function tests, could have contributed to by the Capital Region of Denmark’s Research Foundation
the variability in diagnostic performance of the nonpregnancy and the Novo Nordisk Foundation (ID 0077221). S.B. re

reference interval adaptations assessed in this study. ceived a lecture fee from Merck and Novo Nordisk. All other
Adaptations of nonpregnancy reference limits could be more authors declare no competing interests.
accurate in specific populations, which we were not able to as
sess with sufficient power. Nonetheless, this study reflects
common practice because these factors naturally vary between Data Availability
populations. The results of the current study may not be opti The data that support the findings of this study are not public
mally generalizable to present-day populations because the in ly available due to local, national, and international restric
clusion periods for the majority of included cohorts were tions aimed to protect the privacy of research participants.
between the years 2000 and 2015. It is likely that determi
nants of thyroid function and assay calibrations standards
have changed over time (42). It can, however, be expected References
that large inter-population differences, as demonstrated in
1. Derakhshan A, Peeters RP, Taylor PN, et al. Association of mater
this study, are still present to this day. Ongoing harmonization
nal thyroid function with birthweight: a systematic review and
efforts by the International Federation of Clinical Chemistry
individual-participant data meta-analysis. Lancet Diabetes
and Laboratory Medicine could improve the diagnostic per Endocrinol. 2020;8(6):501-510.
formance of alternative strategies and future studies could as 2. Toloza FJK, Derakhshan A, Mannisto T, et al. Association between
sess if a generalizable rule is more effective in cohorts maternal thyroid function and risk of gestational hypertension and
established after the start of the harmonization efforts. pre-eclampsia: a systematic review and individual-participant data
In conclusion, this is the first study to systematically quan meta-analysis. Lancet Diabetes Endocrinol. 2022;10(4):243-252.
tify the diagnostic performance of standardized modifications 3. Levie D, Korevaar TIM, Bath SC, et al. Thyroid function in early
of nonpregnancy TSH and FT4 reference intervals in preg pregnancy, child IQ, and autistic traits: a meta-analysis of individ
nancy. We show that standardized modifications have poor ual participant data. J Clin Endocrinol Metab. 2018;103(8):
2967-2979.
overlap in diagnostic accuracy compared with cohort and
4. Thompson W, Russell G, Baragwanath G, Matthews J, Vaidya B,
trimester-specific reference intervals, resulting in considerable
Thompson-Coon J. Maternal thyroid hormone insufficiency during
variation in diagnostic performance between populations. pregnancy and risk of neurodevelopmental disorders in offspring: a
Future efforts should be turned toward studying the meaning systematic review and meta-analysis. Clin Endocrinol (Oxf).
fulness of trimester-specific, pregnancy-specific reference in 2018;88(4):575-584.
tervals and the establishment of risk-based decision limits. 5. Han Y, Gao X, Wang X, et al. A systematic review and meta-
analysis examining the risk of adverse pregnancy and neonatal out
comes in women with isolated hypothyroxinemia in pregnancy.
Acknowledgments Thyroid. 2023;33(5):603-614.
The authors gratefully acknowledge all participants, general 6. Korevaar TIM, Derakhshan A, Taylor PN, et al. Association of thy
roid function test abnormalities and thyroid autoimmunity with
practitioners, hospitals, and midwives for their important
preterm birth: a systematic review and meta-analysis. JAMA.
contribution to the establishment of the cohorts and the re
2019;322(7):632-641.
sulting works. Acknowledgments for individual cohorts are 7. Krassas GE, Poppe K, Glinoer D. Thyroid function and human re
listed in the supplemental materials (21). productive health. Endocr Rev. 2010;31(5):702-755.
8. Osinga JAJ, Derakhshan A, Palomaki GE, et al. TSH and FT4 ref
erence intervals in pregnancy: a systematic review and individual
Funding participant data meta-analysis. J Clin Endocrinol Metab.
Netherlands Organization for Scientific Research (grant 2022;107(10):2925-2933.
401.16.020 and 016.176.331) to R.P.P. 9. Springer D, Bartos V, Zima T. Reference intervals for thyroid
markers in early pregnancy determined by 7 different analytical sys
tems. Scand J Clin Lab Invest. 2014;74(2):95-101.
Disclosures 10. Negro R, Attanasio R, Papini E, et al. A 2018 Italian and Romanian
survey on subclinical hypothyroidism in pregnancy. Eur Thyroid J.
P.T. reports a travel grant from Society for Endocrinology 2018;7(6):294-301.
(leadership development award). E.N.G. received speaker’s 11. Toloza FJK, Ospina NMS, Rodriguez-Gutierrez R, et al. Practice
fees and payment for expert testimony from Merck and con variation in the care of subclinical hypothyroidism during preg
sulting fees from Brunel Rus. T.G.M.V. reports grants from nancy: a national survey of physicians in the United States.
the Netherlands Organization for Health Research and J Endocr Soc. 2019;3(10):1892-1906.
12. Lazarus J, Brown RS, Daumerie C, Hubalewska-Dydejczyk A, 26. Lin L, Chu H. Meta-analysis of proportions using generalized linear
Negro R, Vaidya B. 2014 European thyroid association guidelines mixed models. Epidemiology. 2020;31(5):713-717.
for the management of subclinical hypothyroidism in pregnancy 27. Stijnen T, Hamza TH, Ozdemir P. Random effects meta-analysis of
and in children. Eur Thyroid J. 2014;3(2):76-94. event outcome in the framework of the generalized linear mixed
13. Alexander EK, Pearce EN, Brent GA, et al. 2017 guidelines of the model with applications in sparse data. Stat Med. 2010;29(29):
American thyroid association for the diagnosis and management 3046-3067.
of thyroid disease during pregnancy and the postpartum. 28. R Core Team. R: A Language and Environment for Statistical
Thyroid. 2017;27(3):315-389. Computing. R Foundation for Statistical Computing; 2022.
14. Thyroid disease in pregnancy: ACOG practice bulletin, number 29. Balduzzi S, Rucker G, Schwarzer G. How to perform a meta-
223. Obstet Gynecol. 2020;135(6):e261-e274. analysis with R: a practical tutorial. Evid Based Ment Health.
15. Bliddal S, Feldt-Rasmussen U, Boas M, et al. Gestational 2019;22(4):153-160.
age-specific reference ranges from different laboratories misclas 30. Wickham H. ggplot2: Elegant Graphics for Data Analysis.
sify pregnant women’s thyroid status: comparison of two longitu Springer-Verlag New York; 2009.
dinal prospective cohort studies. Eur J Endocrinol. 2014;170(2): 31. Galili T, O’Callaghan A, Sidi J, Sievert C. Heatmaply: an R package
329-339. for creating interactive cluster heatmaps for online publishing.

16. Liu J, Yu X, Xia M, et al. Development of gestation-specific refer Bioinformatics. 2018;34(9):1600-1602.
ence intervals for thyroid hormones in normal pregnant northeast 32. Brochure. Roche Diagnostics GmbH. Reference Intervals for
Chinese women: what is the rational division of gestation stages Children and Adults Elecsys Thyroid Tests 2009.
for establishing reference intervals for pregnancy women? Clin 33. Brochure. Abbott Diagnostic Division. Architect system TSH ref
7K62. 2010.
Biochem. 2017;50(6):309-317.
34. Bours MJ. Bayes’ rule in diagnosis. J Clin Epidemiol. 2021;131:
17. Mehran L, Amouzegar A, Delshad H, et al. Trimester-specific refer
158-160.
ence ranges for thyroid hormones in Iranian pregnant women art
35. Osinga JAJ, Liu Y, Männistö T, et al. Risk factors for thyroid
icle. J Thyroid Res. 2013;2013:651517.
dysfunction in pregnancy: an individual participant data meta-
18. Osinga JAJ, Derakhshan A, Feldt-Rasmussen U, et al. TSH and FT4
analysis. Thyroid. 2024;34(5):646-658 .
reference interval recommendations and prevalence of gestational
36. Korevaar TIM, Medici M, Visser TJ, Peeters RP. Thyroid disease in
thyroid dysfunction: quantification of current diagnostic ap
pregnancy: new insights in diagnosis and clinical management. Nat
proaches. J Clin Endocrinol Metab. 2024;109(3):868-878.
Rev Endocrinol. 2017;13(10):610-622.
19. Thienpont LM, Van Uytfanghe K, De Grande LAC, et al.
37. Glinoer D, de Nayer P, Bourdoux P, et al. Regulation of maternal
Harmonization of serum thyroid-stimulating hormone measure
thyroid during pregnancy. J Clin Endocrinol Metab. 1990;71(2):
ments paves the way for the adoption of a more uniform reference 276-287.
interval. Clin Chem. 2017;63(7):1248-1260. 38. Andersen SL, Andersen S, Carle A, et al. Pregnancy week-specific
20. Thienpont LM, Van Uytfanghe K, Van Houcke S, et al. A progress reference ranges for thyrotropin and free thyroxine in the
report of the IFCC committee for standardization of thyroid func north Denmark region pregnancy cohort. Thyroid. 2019;29(3):
tion tests. Eur Thyroid J. 2014;3(2):109-116. 430-438.
21. Osinga JAJ, Derakhshan A, Korevaar TIM. Data from: reference 39. van den Boogaard E, Vissenberg R, Land JA, et al. Significance of
intervals. Consortium on thyroid and pregnancy. Updated July (sub) clinical thyroid dysfunction and thyroid autoimmunity before
18, 2023. https://www.consortiumthyroidpregnancy.org/reference conception and in early pregnancy: a systematic review. Hum
intervals Reprod Update. 2011;17(5):605-619.
22. Lutgendorf MA, Stoll KA. Why 99% may not be as good as you 40. HAPO Study Cooperative Research Group; Metzger BE, Lowe LP,
think it is: limitations of screening for rare diseases. J Matern et al. Hyperglycemia and adverse pregnancy outcomes. N Engl J
Fetal Neonatal Med. 2016;29(7):1187-1189. Med. 2008;358(19):1991-2002.
23. Goutte C, Gaussier E. A Probabilistic Interpretation of Precision, 41. Xu Y, Derakhshan A, Hysaj O, et al. The optimal healthy ranges
Recall and F-Score, with Implication for Evaluation. Springer; of thyroid function defined by the risk of cardiovascular disease
2005:345-359. and mortality: systematic review and individual participant data
24. Riley RD, Higgins JP, Deeks JJ. Interpretation of random effects meta-analysis. Lancet Diabetes Endocrinol. 2023;11(10):743-754.
meta-analyses. BMJ. 2011;342:d549. 42. Van Uytfanghe K, Ehrenkranz J, Halsall D, et al. Thyroid stimu
25. Osinga JAJ, Derakhshan A, Peeters RP, Korevaar TIM. Data lating hormone and thyroid hormones (triiodothyronine and thy
from: Heatmaps. Consortium on Thyroid and Pregnancy. roxine): an American thyroid association-commissioned review
Updated January 31, 2024. Accessed January 31, 2024. https:// of current clinical and laboratory status. Thyroid. 2023;33(9):
www.consortiumthyroidpregnancy.org/heatmaps 1013-1028.

dgae528

Uploaded by

Copyright:

Available Formats

dgae528

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

dgae528

Uploaded by

Copyright:

Available Formats

The Journal of Clinical Endocrinology & Metabolism, 2024, 109, e2151–e2158

Defining Gestational Thyroid Dysfunction Through

Downloaded from https://academic.oup.com/jcem/article/109/11/e2151/7724966 by guest on 31 October 2024

Downloaded from https://academic.oup.com/jcem/article/109/11/e2151/7724966 by guest on 31 October 2024

Downloaded from https://academic.oup.com/jcem/article/109/11/e2151/7724966 by guest on 31 October 2024

Diagnostic Performance Measures

F-Score Modificaon Sensivity Posive Predicve Value

Downloaded from https://academic.oup.com/jcem/article/109/11/e2151/7724966 by guest on 31 October 2024

Downloaded from https://academic.oup.com/jcem/article/109/11/e2151/7724966 by guest on 31 October 2024

Downloaded from https://academic.oup.com/jcem/article/109/11/e2151/7724966 by guest on 31 October 2024

Downloaded from https://academic.oup.com/jcem/article/109/11/e2151/7724966 by guest on 31 October 2024

Downloaded from https://academic.oup.com/jcem/article/109/11/e2151/7724966 by guest on 31 October 2024

You might also like

dgae528

Uploaded by

Copyright:

Available Formats

dgae528

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

dgae528

Uploaded by

Copyright:

Available Formats

The Journal of Clinical Endocrinology & Metabolism, 2024, 109, e2151–e2158

Defining Gestational Thyroid Dysfunction Through

Downloaded from https://academic.oup.com/jcem/article/109/11/e2151/7724966 by guest on 31 October 2024

Downloaded from https://academic.oup.com/jcem/article/109/11/e2151/7724966 by guest on 31 October 2024

Downloaded from https://academic.oup.com/jcem/article/109/11/e2151/7724966 by guest on 31 October 2024

Diagnostic Performance Measures

F-Score Modificaon Sensivity Posive Predicve Value

Downloaded from https://academic.oup.com/jcem/article/109/11/e2151/7724966 by guest on 31 October 2024

Downloaded from https://academic.oup.com/jcem/article/109/11/e2151/7724966 by guest on 31 October 2024

Downloaded from https://academic.oup.com/jcem/article/109/11/e2151/7724966 by guest on 31 October 2024

Downloaded from https://academic.oup.com/jcem/article/109/11/e2151/7724966 by guest on 31 October 2024

Downloaded from https://academic.oup.com/jcem/article/109/11/e2151/7724966 by guest on 31 October 2024

You might also like

F-Score Modificaon Sensivity Posive Predicve Value