Accuracy and Efficiency of Deep-Learning-Based Automation of Dual Stain Cytology in Cervical Cancer Screening
Accuracy and Efficiency of Deep-Learning-Based Automation of Dual Stain Cytology in Cervical Cancer Screening
Accuracy and Efficiency of Deep-Learning-Based Automation of Dual Stain Cytology in Cervical Cancer Screening
doi: 10.1093/jnci/djaa066
First published online June 25, 2020
Article
Philip E. Castle, PhD , 8 Mark Schiffman, MD, 1 Thomas S. Lorey, MD, 4 Niels Grabe, PhD, 2,5,6
1
Affiliations of authors: Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, USA, 2Steinbeis Transfer Center for Medical Systems
Biology, Heidelberg, Germany, 3Global Coalition Against Cervical Cancer, Arlington, VA, USA, 4Kaiser Permanente TPMG Regional Laboratory, Berkeley, CA, USA,
5
Hamamatsu Tissue Imaging and Analysis Center (TIGA), BIOQUANT, University Heidelberg, Heidelberg, Germany, 6National Center of Tumor Diseases, Medical
Oncology, University Hospital Heidelberg, Heidelberg, Germany, 7University of Oklahoma, Oklahoma City, OK, USA and and 8Albert Einstein College of Medicine,
Bronx, NY, USA
*Correspondence to: Nicolas Wentzensen, MD, PhD, MS, Division of Cancer Epidemiology and Genetics, National Cancer Institute, 9609 Medical Center Drive, Room 6E-
448, Bethesda, MD 20892-9774, USA (e-mail: wentzenn@mail.nih.gov).
Abstract
Background: With the advent of primary human papillomavirus testing followed by cytology for cervical cancer screening,
visual interpretation of cytology slides remains the last subjective analysis step and suffers from low sensitivity and
reproducibility. Methods: We developed a cloud-based whole-slide imaging platform with a deep-learning classifier for p16/
Ki-67 dual-stained (DS) slides trained on biopsy-based gold standards. We compared it with conventional Pap and manual DS
in 3 epidemiological studies of cervical and anal precancers from Kaiser Permanente Northern California and the University
of Oklahoma comprising 4253 patients. All statistical tests were 2-sided. Results: In independent validation at Kaiser
Permanente Northern California, artificial intelligence (AI)-based DS had lower positivity than cytology (P < .001) and manual
DS (P < .001) with equal sensitivity and substantially higher specificity compared with both Pap (P < .001) and manual DS
(P < .001), respectively. Compared with Pap, AI-based DS reduced referral to colposcopy by one-third (41.9% vs 60.1%, P < .001).
ARTICLE
At a higher cutoff, AI-based DS had similar performance to high-grade squamous intraepithelial lesions cytology, indicating a
risk high enough to allow for immediate treatment. The classifier was robust, showing comparable performance in 2 cytology
systems and in anal cytology. Conclusions: Automated DS evaluation removes the remaining subjective component from
cervical cancer screening and delivers consistent quality for providers and patients. Moving from Pap to automated DS
substantially reduces the number of colposcopies and also achieves excellent performance in a simulated fully vaccinated
population. Through cloud-based implementation, this approach is globally accessible. Our results demonstrate that AI not
only provides automation and objectivity but also delivers a substantial benefit for women by reduction of unnecessary
colposcopies.
Advances in digital imaging and machine learning can revolu- high-risk human papillomavirus (HPV) screening (5–7).
tionize cancer screening, diagnosis, and treatment by improving Although a negative HPV test provides great reassurance of low
accuracy and reproducibility of image assessment and stream- cervical cancer risk over the next decade (8–10), only a small
lining clinical workflow (1–4). With its requirement for high subset of women with a positive HPV test require further evalu-
throughput and fast turnaround and its dependence on micro- ation. To avoid overburdening the system with HPV-positive
scopic and visual technologies, automation can play a major women, additional triage is required for colposcopy referral (11,
role in improving the efficiency of cervical cancer screening. 12). Current triage strategies include partial HPV genotyping
Many countries are currently switching from Pap cytology to and Pap cytology (7, 13). The limited sensitivity and
Received: November 5, 2019; Revised: March 18, 2020; Accepted: April 30, 2020
© The Author(s) 2020. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/
licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
For commercial re-use, please contact journals.permissions@oup.com
72
N. Wentzensen et al. | 73
reproducibility of cytology require laborious quality control pro- candidate CNN was applied on the slide level on training slides.
cedures and frequent retesting (14, 15). Improving the efficiency A cutoff of positive tiles is used to determine slide positivity (3
of cervical cancer screening is particularly important for vacci- tiles per cell for CNN4 and 2 tiles per cell for IncV3). From mis-
nated populations due to lower disease prevalence and higher classified slides, false-positive or false-negative tiles were
demands for screening test performance. extracted and fed back into the original CNN training to opti-
A promising triage strategy is concomitant detection of p16 mize classification accuracy of the CNN. A final locked CNN was
and Ki-67 in the same cell (p16/Ki-67 dual stain [DS]), 2 markers applied on the patient level on the blinded validation set com-
that are closely linked to cervical carcinogenesis and HPV onco- prising 3803 slides. CNN4 showed good performance in
protein actions. The HPV oncoprotein E7 interrupts cell cycle Thinprep slides but not in Surepath slides. Subsequently, a sec-
control by releasing E2F, activating p16 expression. The coex- ond algorithm (IncV3) was trained specifically for Surepath
pression of p16 and Ki-67, a cell proliferation marker, in the slides (Supplementary Methods, available online). We published
same cell is specific to HPV-related carcinogenesis. DS has a GitHub repository and created a web page at https://github.
shown greater accuracy for detection of HPV-related precancers com/stcmedhub/dual_stain_dl with a source code description of
compared with cytology (16–21). Currently, artificial intelligence the models and the installation instructions.
(AI) algorithms mostly try to match manual reading accuracy to
ARTICLE
of epithelial cells. For training of the deep-learning algorithms, ing (16). From a screening population of more than 300 000
tiles from training slides were manually evaluated for DS- women in a year, 3333 slides from HPV-positive women were in-
positive cells by 3 observers (Supplementary Figure 1, available cluded. From 238 training slides, 8215 DS-positive and 9739 DS-
online). negative tiles were used for training (Figure 1). The study was
approved by the KPNC IRB and was exempted from institutional
review at the NCI by the Office of Human Subjects Research.
Deep Learning Patient consent was waived because deidentified discarded
specimens were used in this study.
Two deep-learning approaches (Convolutional Neural Network
with 4 layers [CNN4] and Inception-v3 with 48 layers [IncV3])
were developed sequentially as shown in Figure 1 and described
Clinical Endpoints
in Supplementary Methods (available online). The algorithms
determine the number of DS-positive cells on a slide by detect- All studies followed routine clinical practice at the respective
ing the number of tiles above a certain likelihood threshold. A institutions. Cytology was classified by the Bethesda System:
slide is considered positive if the number of DS-positive cells on negative for intraepithelial lesions or malignancy, atypical
a slide exceeds a certain cutoff. Training and validation were squamous cells of undetermined significance, low-grade squa-
conducted on the tile level and the slide level. First, a training mous intraepithelial lesions, and high-grade squamous intrae-
set from 450 patients was selected for which the clinical end- pithelial lesions (HSIL) (26). Final diagnosis was established by
point cervical intraepithelial neoplasia grade 3 or greater histopathology classified according to the cervical intraepithe-
(CIN3þ) was unblinded. Tiles were selected for initial training lial neoplasia (CIN) scale for cervical endpoints, which indicates
(80%) and validation (20%) of the algorithm. The deep-learning the extent of dysplastic cells in the cervical epithelium: no indi-
network provides a likelihood for each tile above which it is con- cation for biopsy, normal CIN, grade 1 (CIN1), grade 2 (CIN2),
sidered positive (0.5 for CNN4 and 0.4 for IncV3). The resulting grade 3 (CIN3), and cancer. We grouped adenocarcinoma in situ
74 | JNCI J Natl Cancer Inst, 2021, Vol. 113, No. 1
(% likelihood threshold) + +
409 paents 299 paents 3,095 paents Slide to les Number of posive les + - Clinical
3,803 paents
- + performance
- -
Figure 1. Study design. AI ¼ artificial intelligence; CNN ¼ convolutional neural network; CIN3þ ¼ cervical intraepithelial neoplasia grade 3 or worse; DS ¼ dual stain.
with CIN3. For anal disease endpoints, the comparable anal (AUC) was calculated. Sensitivity and specificity coordinates for
intraepithelial neoplasia nomenclature (AIN) was used. manual DS evaluation and cytology were plotted on the receiver
operator characteristics curve for comparison. We calculated
percentage positivity, sensitivity, specificity, and Youden’s in-
p16/Ki-67 Staining and Evaluation dex in the Biopsy Study and ACSS for the cutoff determined by
CNN4 and for manual DS evaluation. In the Kaiser Study, with a
For the Biopsy Study and ACSS, slides were prepared from resid-
representative population of women who underwent routine
ual PreservCyt material using a T2000 processor (Hologic,
screening, we calculated percentage positivity, sensitivity, spe-
Bedford, MA). For the KPNC study, slides were prepared from re-
cificity, and positive and negative predictive values for auto-
sidual SurePath tubes according to the manufacturer’s instruc-
mated and manual DS. Differences in positivity, sensitivity, and
tions (BD, Sparks, MD). Immunostaining of cervical cytology
specificity were evaluated using an exact McNemar’s v2, and dif-
ARTICLE
slides for p16/Ki-67 was performed using the CINtec Plus Kit
ferences in predictive values were evaluated using the R pack-
(Roche, Tucson, AZ) according to the manufacturer’s instruc-
age DTComPair, using the generalized score statistic (27). To
tions. DS-trained cytotechnologists reviewed all slides; a slide
evaluate clinical efficiency of each strategy, we estimated the
was considered positive if 1 or more cervical epithelial cell(s)
number of CIN3þ detected for different cutoffs of DS-positive
stained both with a brown cytoplasmic stain (p16) and a red nu-
cells and the ratio of the number of tests and colposcopies per
clear (Ki-67) irrespective of morphologic abnormalities. Slides
case of CIN3þ detected. We also evaluated the theoretical per-
from the Biopsy Study and ACSS were stained and evaluated at
formance of automated DS in a fully vaccinated population by
Roche mtm laboratories AG, Heidelberg, Germany, whereas
excluding all women who were positive for HPV16 and/or
slides from the Kaiser DS study were stained and evaluated at
HPV18 from the analysis. Analyses were performed in SPSS,
KPNC. HPV testing with partial genotyping (HPV16 and HPV18)
Stata, and R. All statistical tests were 2-sided and P less than .05
at KPNC was based on the cobas assay (Roche, Pleasanton, CA).
was considered statistically significant.
Statistical Analysis
Results
We created boxplots and calculated medians to show the distri-
bution of DS-positive cells in cytology and histology categories. Automated Detection of DS-Positive Cells in Colposcopy
We compared differences in DS cell counts in ordinal cytology
and Anoscopy Populations
and histology categories using 1-way analysis of variance. The
primary endpoint for the Biopsy Study and the Kaiser Study was We developed a deep-learning algorithm for automated detec-
CIN3 or greater (CIN3þ). For ACSS, the primary endpoint was tion of DS-positive cells on ThinPrep slides from 2 referral popu-
AIN2 or AIN3 (AIN2þ). Receiver operator characteristics curve lations (Biopsy Study and ACSS), including 212 training slides
analysis was conducted for the number of DS-positive cells with 1186 DS-positive and 1485 DS-negative tiles (Figure 1). We
against the primary endpoints, and the area under the curve evaluated the algorithm in independent validation slides from
N. Wentzensen et al. | 75
Table 1. Accuracy for cervical and anal precancer based on manual and automated detection of DS-positive cells on ThinPrep slides in a col-
poscopy population (Biopsy Study, N ¼ 409) and an anoscopy population (ACSS, N ¼ 299)
ARTICLE
a
Two-sided McNemar’s test. ACCSS ¼ Anal Cancer Screening Study; AIN2þ ¼ anal intraepithelial neoplasia grade 2 or worse; AUC ¼ area under the curve; CIN3þ ¼ cer-
vical intraepithelial neoplasia grade 3 or worse; DS ¼ dual stain; CNN ¼ convolutional neural network; Ref ¼ referent.
bCNN4 cutoff for Biopsy: 3 or more cells; CNN4 cutoff for Anal: 3 or more cells.
both studies (Figure 1). In both studies, we observed an increase specificity compared with manual DS (36.1% vs 46.1%, respec-
in the number of DS-positive cells by increasing severity of cy- tively, P ¼ .001) (Table 1).
tology and histology, with higher absolute DS-positive cell num-
bers in ACSS (P < .001 for all comparison; Supplementary Figure
2, available online).
Automated Detection of DS-Positive Cells in an HPV
In the Biopsy Study validation set with 53 CIN3þ, the AUC
Screening Population
for detecting CIN3þ based on automated DS using CNN4 was
0.74 (Figure 2). At a cutoff of 3 DS-positive cells, the CNN4 algo- We developed the deep-learning algorithm for SurePath slides
rithm had marginally lower positivity (58% vs 63%, respectively, using a training set of 238 slides from the Kaiser study with
P ¼ .06) with comparable sensitivity (P ¼ 1.0) and marginally 8215 DS-positive and 9739 DS-negative tiles and applied it in an
higher specificity compared with manual DS (40.6% vs 45.7%, re- independent validation set of slides from 3095 women. We ob-
spectively, P ¼ .07) (Table 1). served an increase of DS-positive cells with increasing severity
In the ACSS validation set with 69 AIN2þ, the AUC for detect- of cytology and histology (Supplementary Figure 3, available
ing AIN2þ based on automated evaluation of DS slides with online).
CNN4 was 0.77 (Figure 2). At a cutoff of 3 DS-positive cells, the In the Kaiser validation study including 218 CIN3þ, the AUC
positivity of the CNN4 algorithm was lower (63% vs 71%, respec- for detecting CIN3þ based on automated evaluation of DS slides
tively, P ¼ .001) with comparable sensitivity (P ¼ 1.0) and higher was 0.82 (Figure 3). At a cutoff of 2 cells, the positivity of the
76 | JNCI J Natl Cancer Inst, 2021, Vol. 113, No. 1
algorithm was statistically significantly lower (42% vs 50%, re- treatment according to current management guidelines
spectively, P < .001) with equal sensitivity (P ¼ .4) but statistically (Figure 3). Automated DS provided better risk stratification com-
significantly higher specificity (61.5% vs 52.6%, respectively, pared with Pap cytology and manual DS (Figure 4): more women
P < .001) compared with the manual DS. At a cutoff of 100 cells, were reassured of a lower risk compared with the other strate-
accuracy approached HSIL cytology that allows for immediate gies (58% for automated DS vs 50% for manual DS and 40% for
cytology), and risk among positives was higher.
Figure 4. Absolute risk of precancer for Pap cytology, manual dual stain (DS), and automated DS. ASCUSþ¼ positive for Atypical Squamous Cells of Undetermined
Significance or greater cytology results. The dotted lines show clinical action risk thresholds for colposcopy referral (4% risk) and immediate treatment (50% risk).
N. Wentzensen et al. | 77
Discussion
Table 2. Accuracy for cervical precancer based on Pap cytology and manual and automated detection of DS-positive cells on SurePath slides in the Kaiser Validation Study (N ¼ 3095)
.02/Ref
.03/.8
.01/.5
Ref
Two-sided McNemar’s test. CIN3þ ¼ cervical intraepithelial neoplasia grade 3 or worse; DS ¼ dual stain; NPV ¼ negative predictive value; PPV ¼ positive predictive value; Ref ¼ referent. vical cancer screening by substantially reducing unnecessary
colposcopies compared with current standards and similarly
achieves excellent performance in a simulated fully vaccinated
.002/ <.001
supplant and improve the role of Pap cytology for triage of HPV-
positive women and should also be evaluated for postcolpo-
scopy and posttreatment surveillance (16). Compared with Pap
cytology, manual DS has higher accuracy and can provide lon-
/manual DS)
.2/Ref
ARTICLE
cases are less likely to progress (16). Automated DS evaluation
Sensitivity,
% (95% CI)
1860 (60.1)
1536 (49.6)
1298 (41.9)
1741 (56.3)
Automated DS
14. Wright TC Jr, Stoler MH, Behrens CM, et al. Interlaboratory variation in the 23. Grabe N, Lahrmann B, Pommerencke T, et al. A virtual microscopy system to
performance of liquid-based cytology: insights from the ATHENA trial. Int J scan, evaluate and archive biomarker enhanced cervical cytology slides. Cell
Cancer. 2014;134(8):1835–1843. Oncol. 2010;32(1-2):109–119.
15. Stoler MH, Schiffman M. Interobserver reproducibility of cervical cytologic 24. Wentzensen N, Walker JL, Gold MA, et al. Multiple biopsies and detection of
and histologic interpretations: realistic estimates from the ASCUS-LSIL cervical cancer precursors at colposcopy. J Clin Oncol. 2015;33(1):83–89.
Triage Study. JAMA. 2001;285(11):1500–1505. 25. Wentzensen N, Follansbee S, Borgonovo S, et al. Human papillomavirus gen-
16. Wentzensen N, Clarke MA, Bremer R, et al. Clinical evaluation of HPV screen- otyping, human papillomavirus mRNA expression, and p16/Ki-67 cytology to
ing with p16/Ki-67 dual stain triage in a large organized cervical cancer detect anal cancer precursors in HIV-infected MSM. Aids. 2012;26(17):
screening program. JAMA Intern Med. 2019;179(7):881. 2185–2192.
17. Clarke MA, Cheung LC, Castle PE, et al. Five-year risk of cervical precancer fol- 26. Solomon D, Davey D, Kurman R, et al. The 2001 Bethesda system:
lowing p16/Ki-67 dual-stain triage of HPV-positive women. JAMA Oncol. 2019; terminology for reporting results of cervical cytology. JAMA. 2002;287(16):
5(2):181. 2114–2119.
18. Wentzensen N, Fetterman B, Castle PE, et al. p16/Ki-67 dual stain cytology for 27. Leisenring W, Alono T, Pepe MS. Comparisons of predictive values of binary
detection of cervical precancer in HPV-positive women. J Natl Cancer Inst. medical diagnostic tests for paired designs. Biometrics. 2000;56(2):345–351.
2015;107(12):djv257. 28. Schiffman M, Doorbar J, Wentzensen N, et al. Carcinogenic human papillo-
19. Wentzensen N, Schwartz L, Zuna RE, et al. Performance of p16/Ki-67 immu- mavirus infection. Nat Rev Dis Primers. 2016;2(1):16086.
nostaining to detect cervical cancer precursors in a colposcopy referral popu- 29. Franco EL, Cuzick J. Cervical cancer screening following prophylactic human
lation. Clin Cancer Res. 2012;18(15):4154–4162. papillomavirus vaccination. Vaccine. 2008;26(Suppl 1):A16–A23.
20. Wright TC Jr, Behrens CM, Ranger-Moore J, et al. Triaging HPV-positive 30. Massad LS, Einstein MH, Huh WK, et al. 2012 ASCCP Consensus Guidelines
ARTICLE