Abstract
Chest X-rays are the most commonly performed medical imaging exam, yet they are often misinterpreted by physicians. Here, we present an FDA-cleared, artificial intelligence (AI) system which uses a deep learning algorithm to assist physicians in the comprehensive detection and localization of abnormalities on chest X-rays. We trained and tested the AI system on a large dataset, assessed generalizability on publicly available data, and evaluated radiologist and non-radiologist physician accuracy when unaided and aided by the AI system. The AI system accurately detected chest X-ray abnormalities (AUC: 0.976, 95% bootstrap CI: 0.975, 0.976) and generalized to a publicly available dataset (AUC: 0.975, 95% bootstrap CI: 0.971, 0.978). Physicians showed significant improvements in detecting abnormalities on chest X-rays when aided by the AI system compared to when unaided (difference in AUC: 0.101, pâ<â0.001). Non-radiologist physicians detected abnormalities on chest X-ray exams as accurately as radiologists when aided by the AI system and were faster at evaluating chest X-rays when aided compared to unaided. Together, these results show that the AI system is accurate and reduces physician errors in chest X-ray evaluation, which highlights the potential of AI systems to improve access to fast, high-quality radiograph interpretation.
Similar content being viewed by others
Introduction
Chest X-rays are the most commonly performed medical imaging exam, and accurate and timely interpretation of chest X-rays is critical for providing high quality patient care1. However, it has been reported that physicians miss findings in approximately 30% of abnormal exams2,3, which may lead to delayed treatments, unnecessary costs, malpractice lawsuits, and preventable morbidity1,4,5,6,7,8,9,10,11. Physicians, regardless of specialty12, can make interpretational errors on radiographs and misinterpretations can arise from many factors including, but not limited to, the number of findings on a radiograph, amount of experience or diagnostic skill of the physician, fatigue, and cognitive or attentional constraints3,13,14,15,16,17,18,19,20,21. Further, the substantially growing workload for radiologists and the need for treating physicians to make time-sensitive decisions results in non-radiologist physicians being tasked with interpreting medical images despite lacking extensive radiology education and training16,22,23,24,25. Physicians without extensive radiology training are less accurate at interpreting chest X-rays26,27,28 and pose an even larger risk of misdiagnosis29,30.
Artificial intelligence (AI) systems using a subclass of AI called deep learning can accurately interpret and classify abnormal findings on chest X-rays31,32. These AI models are often highly successful, with performance matching, or even exceeding, that of radiologists33,34,35,36,37,38,39,40. However, many of these models have shortcomings, which limit the scope of their clinical applications. Some models only detect select findings on chest X-rays37,38,39,40,41,42,43,44,45,46,47,48,49,50,51 and others have large variations in the amount and quality of data used during training and testing38. Additionally, the majority of clinical AI models have not been assessed and cleared through FDAâs rigorous review process52. Finally, most AI models have only been shown to improve radiologistsâ accuracy41,53, but it is important to understand how other physician specialties may benefit from these AI systems because non-radiologist physicians often interpret chest X-rays22. For an AI model to be the most beneficial in reducing errors in chest X-ray interpretation, it should be able to detect multiple abnormalities, generalize to new patient populations, and be cleared by FDA to enable clinical adoption across physician specialties54,55.
In the current work, we describe our development of an FDA-cleared, computer assisted detection (CAD) system (device name: Chest-CAD, 510(k) number: K210666), which uses deep learning to assist physicians during their interpretation of chest X-rays. The AI system56 identifies suspicious regions of interest (ROIs) and assigns each ROI to one of eight clinical categories consistent with the reporting guidelines from the Radiological Society of North America (Cardiac, Mediastinum/Hila, Lungs, Pleura, Bones, Soft Tissues, Hardware, or Other)57. The eight categories exhaustively encompass any suspicious ROIs present in a chest X-ray and the AI system produces boxes around the ROIs. Figure 1 shows an example of the AI system output and how the AI system identified three ROIs, and classified them into two categories, Lungs and Mediastinum/Hila. See the Supplementary Information for more details about how the AI algorithm was trained for each category.
To test the efficacy of the AI system, we first assessed performance overall and per category in the absence of any interaction with a physician (referred to as standalone testing) on a dataset of 20,000 chest X-ray cases from 12 healthcare centers in the U.S, which was independent from the dataset used to train the AI system. We also tested the generalizability of our findings on a publicly available chest X-ray dataset. Next, we evaluated radiologist and non-radiologist physician performance with and without the assistance of the AI system (referred to as clinical testing). In this study we aim to find (1) the accuracy of the AI system, trained on a large, diverse dataset (see Table 1 for dataset characteristics), in identifying and categorizing abnormalities on chest X-rays, (2) the consistency of the AI systemâs standalone performance across multiple datasets, and (3) the performance of the radiologist and non-radiologist physicians in detecting and categorizing abnormalities on chest X-rays when aided by the AI system compared to when unaided.
Results
Evaluation of the AI system
We first evaluated the standalone performance of the AI system to determine its accuracy in detecting chest X-ray abnormalities relative to a reference standard defined by a panel of expert radiologists. Details about the reference standard are described in Methods and in Table 1. Standalone performance was assessed on 20,000 adult cases that met the inclusion criteria for the AI systemâs indications for use (patient ageâ>â=â22 and 1 or 2 image chest X-ray cases). Sampling procedures for the standalone testing dataset are described in the Methods. The AI system had a high level of agreement with expert radiologists and also generalized across patient and image characteristics. The AI system demonstrated high overall AUC (0.976, 95% bootstrap CI: 0.975, 0.976), sensitivity (0.908, 95% bootstrap CI: 0.905, 0.911), and specificity (0.887, 95% bootstrap CI: 0.885, 0.889) in identifying chest abnormalities in the standalone testing dataset. Figure 2 shows the number of positive cases in each of the eight categories, along with high sensitivity, specificity, and AUCs per category. The AI systemâs performance was also evaluated across patient and image characteristics in the standalone testing dataset. Supplementary Table 1 shows the AUC by patient sex, patient age group, number of images per case, brightness level, contrast level, resolution level, and X-ray device manufacturer. All AUCs were over 0.950, indicating that the AI system had similarly high performance across subgroups. The accuracy of the ROI location was also compared between the reference standard ROI and the AI systemâs ROI by measuring the intersection-over-union (IoU) per category and showed a high degree of overlap58 (see Supplementary Table 2).
Performance of the AI system on the NIH ChestX-ray8 dataset
To test the accuracy of the AI system on an independent chest X-ray dataset, standalone performance was also assessed on a subset of the publicly available National Institutes of Health (NIH) data (ChestX-ray8)59. One thousand cases were randomly selected, and analyses were conducted on 922 cases that met the inclusion criteria for the AI systemâs indications for use. Details about the reference standard labels are in Table 1 and are described in Methods. The AI system demonstrated high overall AUC (0.975, 95% bootstrap CI: 0.971, 0.978), sensitivity (0.907, 95% bootstrap CI: 0.894, 0.919), and specificity (0.887, 95% bootstrap CI: 0.878, 0.896) in identifying chest abnormalities in the subset of data from the ChestX-ray8 dataset. The AI system also demonstrated high AUCs for each category, consistent with the AUCs in the standalone testing described above, suggesting that the performance of the AI system generalizes to distinct datasets.
Evaluation of physician performance
We tested physiciansâ accuracy in detecting abnormalities on chest X-rays when unaided and aided by the AI system through a multi-reader, multi-case (MRMC) study described in Methods. The AI system significantly improved physiciansâ accuracy at detecting chest abnormalities across all categories (pâ<â0.001; unaided AUC: 0.773; aided AUC: 0.874; difference in least squares mean AUCâ=â0.101, 95% CI: 0.101, 0.102). Figure 3a shows the Receiver Operating Characteristic (ROC) curves when physicians were unaided and aided by the AI system. Physician sensitivity increased from 0.757 (95% CI: 0.750, 0.764) when unaided to 0.856 (95% CI: 0.850, 0.862) when aided, demonstrating that physicians had a relative reduction in missed abnormalities of 40.74%. Physician specificity increased from 0.843 (95% CI: 0.839, 0.847) when unaided to 0.870 (95% CI: 0.866, 0.873) when aided, showing that the use of the AI system did not result in physicians overcalling abnormalities, but instead assisted in correctly identifying cases with no suspicious ROIs. Further, AUC values for every individual physician increased when aided by the AI system compared to when unaided (Fig. 3b; see Supplementary Table 5 for the 24 physiciansâ unaided and aided performance).
As expected, radiologists were more accurate than non-radiologist physicians (emergency medicine, family medicine, and internal medicine physicians) at identifying abnormalities in chest X-rays when unaided by the AI system (Fig. 4). Despite high unaided accuracy, radiologists still showed an improvement when assisted by the AI system with an unaided AUC of 0.865 (95% CI: 0.858, 0.872) and an aided AUC of 0.900 (95% CI: 0.895, 0.906). Internal medicine physicians showed the largest improvement when assisted by the AI system with an unaided AUC of 0.800 (95% CI: 0.793, 0.808) and an aided AUC of 0.895 (95% CI: 0.889, 0.900; see Fig. 4). There was a significant difference in AUC between radiologists and non-radiologist physicians in the unaided condition (pâ<â0.001), but no significant difference in the aided condition (pâ=â0.092), suggesting that the AI system aids non-radiologist physicians to detect abnormalities in chest X-rays with similar accuracy to radiologists. Radiologists and non-radiologist physicians also experienced a relative reduction in missed abnormalities of 29.74% and 44.53%, respectively. The AUC values overall and per category for radiologists and non-radiologist physicians when unaided and when aided are reported in Supplementary Table 3.
In addition to physicians demonstrating improved detection accuracy, assistance by the AI system reduced physiciansâ case read times. Non-radiologist physicians detected abnormalities in chest X-ray cases significantly faster when aided by the AI system versus when unaided, with an average improvement of 10 s (t (1,17)â=â2.281, pâ=â0.036 (uncorrected), 7.94% faster, aided median read timeâ=â98.5 s, unaided median read timeâ=â107 s). There was no difference in radiologist read times when aided by the AI system versus when unaided (t (1,5)â=â0.267, pâ=â0.800, aided median read timeâ=â67.5 s, unaided median read timeâ=â69.5 s; see Supplementary Table 4).
Discussion
The FDA-cleared AI system demonstrated strong standalone performance, detecting chest abnormalities on X-ray on par with radiologists, and the AI systemâs performance generalized to a separate publicly available chest X-ray dataset. In the clinical testing, we showed that overall physician accuracy improved when aided by the AI system, and non-radiologist physicians were as accurate as radiologists in evaluating chest X-rays when aided by the AI system. Taken together, our findings show that the AI system supports different physician specialties in accurate chest X-ray interpretation.
The AI system was trained using deep learning and performed with high accuracy in the standalone testing dataset, with an overall AUC of 0.976. Accuracy remained high across the eight categories of the AI system (all AUCsâ>â0.92) and across different patient and image characteristics (all AUCsâ>â0.95). Additionally, we demonstrated the generalizability of the AI system by evaluating performance on the NIHâs publicly available chest X-ray dataset. The system generalized well, with similarly high performance on both datasets (NIH dataset AUC: 0.975; standalone testing dataset AUC: 0.976). The high performance of the AI system can be partly attributed to the large and diverse training dataset and quality of the labels. We manually created over six million labels from nearly 500,000 radiographs. Consistent with our previous work, the labels were from expert U.S. board-certified radiologists to ensure high clinical accuracy and generalizability of the machine learning algorithms60,61. Prior work with AI systems for radiography have relied on labels extracted retrospectively from radiology reports using NLP techniques36,62,63,64, which can result in inaccuracies65,66,67. As such, there is a growing consensus that high-quality clinical labels are essential to high-performing AI systems68,69,70.
The AI system has the potential to positively impact patients and the healthcare system as demonstrated by physiciansâ consistently higher accuracy at detecting abnormalities on chest X-rays when assisted by the AI system. Every physician in the study improved when aided compared to unaided, suggesting each physician, even those that are highly skilled at chest X-ray interpretation, benefited from using the AI system. There was no difference in accuracy between radiologists and non-radiologist physicians when aided by the AI system. Further, non-radiologist physicians were faster to detect abnormalities on chest X-rays when aided compared to when unaided. This suggests that the AI system increased non-radiologist physiciansâ accuracy and efficiency, such that they performed more similarly to radiologists. Improving the accuracy of chest X-ray interpretation with the AI system has the potential to reduce misdiagnosis, leading to improved quality of care and decreased costs, and should be directly measured in future prospective trials.
Improving non-radiologist physiciansâ ability to interpret chest X-rays is critical for improving patient outcomes. Non-radiologist physiciansâ are often the first to evaluate patients in many care settings and routinely interpret chest X-rays when a radiologist is not readily available27. Additionally, the share of practicing radiologists relative to other physician specialties is declining in the U.S.71 , particularly in rural counties71,72,73. As a result, practicing radiologists are overburdened and even large institutions, such as the Veterans Affairs hospitals, have backlogs of hundreds of thousands of unread X-ray cases74. Therefore, non-radiologist physicians are being tasked with interpreting chest X-rays, despite lack of training and poorer accuracy22,23,24,25. The majority of prior studies have conducted clinical testing on AI systems with radiologists only48,50,53. Here, we provide evidence that the AI system aids non-radiologist physicians, which can lead to greater access to high quality medical imaging interpretation and may reduce misdiagnoses on chest X-rays when radiologists are unavailable.
There were limitations in this study. First, in the clinical study physicians were not given access to additional patient information for the chest X-rays in the study. This was an intentional study design decision to keep the information provided to the physician and the AI system the same. Second, future prospective studies will be necessary to measure the real-world impact of the AI system on directly reducing physiciansâ misdiagnoses and subsequently improving patient outcomes. The AI system is well positioned to be implemented and integrated into clinical settings due to its FDA-clearance56 and category outputs that align with standard radiology reporting guidelines57. Third, a subset of positive cases were labeled for the specific abnormality detected by the expert radiologistsâ and the performance of the AI system was evaluated in a more granular manner, however, future analyses with a larger sample size is required to investigate the performance for lower prevalence conditions (see Supplementary Table 6). Finally, while the AI systemâs categories were designed to be mutually exclusive, there is a possibility that the expert radiologists interpreted the same abnormality on an X-ray differently and selected two different categories as abnormal when providing ground truth labels. Since the reference standard is determined by the majority opinion, this situation could result in no suspicious ROI being found, which may cause the reported sensitivity values to be higher than the true performance. The expert radiologists received extensive training before providing labels as to the types of abnormalities associated with each category to minimize any confusion about the definitions for each category (see Supplementary Information AI System Categories).
Overall, the FDA-cleared AI system, trained using deep learning, demonstrated strong standalone performance and increased physician accuracy in detecting abnormalities on chest X-rays. It eliminated the gap in accuracy between radiologists and non-radiologist physicians when detecting abnormalities on chest X-rays and facilitated non-radiologist physicians to read cases more efficiently. Thus, the AI system has the potential to increase timely access to high-quality care across the U.S. healthcare system and improve patient outcomes.
Methods
AI system development
To build the AI system, 17 U.S. board-certified radiologists with a median of 14 years of experience manually annotated a development dataset of 341,355 chest X-ray cases, generating a total of 6,202,776 labels. The development dataset consisted of 492,996 de-identified radiographs from 185,114 patients collected from 15 hospitals, outpatient care centers, and specialty centers in the United States. The development dataset was then randomly split into a training set (326,493 cases; 471,358 radiographs) and a tuning set (14,862 cases; 21,638 radiographs). See Table 1 for details about the labels, as well as the patient and image characteristics of the development dataset.
To ensure that the radiologists produced high-quality labels for the AI system, each radiologist completed rigorous, multi-step training programs. The expert radiologists were responsible for jointly localizing abnormalities on all radiographs within the case (the localized area was considered the âreference standardâ area) and producing case-level judgments as to the presence or absence of abnormalities for each of the eight categories. Each case was interpreted by one to six radiologists. Categories were based on standard radiology reporting guidelines as defined by Radiological Society of North America57. The expert radiologists were instructed to look for abnormalities when labeling a suspicious ROI in each category. For example, in the Cardiac category, specific abnormalities the expert radiologists were instructed to look for included cardiomegaly, suspected pericardial effusion, chamber enlargement, valvular calcification, dextrocardia, dextroposition, constrictive pericarditis, coronary artery calcifications, suspected pericardial cyst, and obscured cardiac silhouette. The AI system categories are explained in more depth in Supplementary Information.
The AI system was trained using deep learning to recognize visual patterns that correspond to abnormalities within each category through a supervised training process. The system takes a chest X-ray case as input and produces a binary output indicating the presence or absence of an abnormality for each category. It also produces category-specific bounding box(es) when an abnormal region is identified. The binary outputs for each of the eight categories were defined to be mutually exclusive and collectively exhaustive, so that any clinically relevant abnormality detected by any of the expert radiologists was included in one of the categories.
Algorithm design
The AI systemâs processing of a chest X-ray case consisted of three stages: pre-processing, analysis, and post-processing.
Pre-processing stage
Input radiographs from a given chest X-ray case were automatically pre-processed in order to standardize their visual characteristics. Each radiograph, which was required to be a high resolution input (over 1440 pixels), was first cropped to remove excess black padding around the edges. Next, resizing operations that preserved the aspect ratio were applied to standardize image resolution to a height of 800 pixels. The resizing operations downscaled the image with anti-aliasing filters on and with bilinear interpolation. If necessary, in order to reach a target width of 1,024 pixels, the resize operation added padding to the edges of the image. The image was then cropped, if necessary, to a target width of 1,024 pixels.
Analysis stage
The analysis stage took the pre-processed radiographs of a chest X-ray case and used a convolutional neural network to create two outputs per category. One was a Bayesian posterior probability distribution characterizing the modelâs belief about the presence of an abnormality within the category. The other was a pixel-wise probability map representing an estimate of where any such abnormality would be within the case. A high-level schematic of the architecture is shown in Fig. 5.
The model produced case-level decisions by jointly analyzing one or more input radiographs from a given case. First, a single case-level embedding was created by separately running each radiograph through an image encoder, producing one 512-dimensional feature vector per radiograph. Each of the 512-dimensional vectors was then passed through a fully connected layer to create 1024-dimensional image feature vectors. Next, the image encodings were collapsed across case-specific radiographs into a single 1024-dimensional feature vector through max-pooling. The image encoder used for this architecture was Resnet-3475, which was chosen for parsimony from several options that all yielded similar results (e.g., Densenet). It is a widely used backbone for many computer vision tasks and has a low computational cost compared to many other common architectures. Probability maps for localization were produced by the model through a convolutional layer applied to feature maps from the image encoderâs penultimate layer.
The case embedding was run through a fully connected layer to produce the output parameters âalphaâ and âbetaâ for each category, which parameterized a beta-binomial distribution. From a Bayesian perspective, this distribution represents the modelâs prediction for how many k expert radiologists out of a group of n expert radiologists are likely to identify an abnormality when asked to interpret the chest X-ray. This likelihood model accounts for the observed spread of the available k-of-n binomial labels, where, as previously described, the binomial labels arose by having multiple expert radiologists review many of the cases within the development dataset. At inference time, a point estimate defined by the mean of the distribution was used by the post-processing stage of the software to determine the presence of a potential abnormality.
Post-processing stage
The system took the output from the model and applied post-processing operations to create two outputs per category: (1) a binary determination representing the systemâs prediction of whether any abnormalities for that category were present within the X-ray case and (2) a set of bounding boxes surrounding any such abnormalities.
The binary determination was calculated from the point estimates produced by the model using a category-specific threshold pre-computed on the tuning dataset. Any score lying on or above the threshold resulted in an abnormality-present determination, and any score below the threshold resulted in an abnormality-absent determination. The thresholds were optimized to yield equal sensitivity and specificity per category on the tuning set. The set of bounding boxes were created from the categoryâs pixel-wise probability map output. This was done using a heuristic that placed boxes around the site of high-probability regions using pre-computed, category-specific thresholds that binarized the probability map.
Model training
The model was trained by minimizing a joint loss function that assessed the modelâs ability to correctly predict the case-level classifications (âROI presentâ or âROI absentâ for each output category) and the ability to correctly predict the location of the abnormality within individual radiographs. The joint-loss function was defined as a weighted sum of two terms for each category and then summed across categories. The first term was the across-radiograph average per-pixel binary cross-entropy loss between a given radiographâs predicted probability map and the ground truth map for that radiograph. The second term was the negative log-likelihood of the beta-binomial distribution, where the distributionâs two free parameters (denoted as âalphaâ and âbetaâ in Fig. 5) were outputs of the model. The binomial observations used in the beta-binomial negative log-likelihood correspond to k out of n labeling radiologists indicating the presence of an abnormality for the given case and category; note that n and k vary per case, where some cases were labeled by only a single expert radiologist and others were labeled by multiple independent expert radiologists. To increase the robustness of the model, data augmentation was used during training. Radiographs were randomly rotated, vertically or horizontally flipped, gamma-corrected, contrast-adjusted, and cropped.
The training algorithm repeatedly iterated through the training set in randomized batches of 64 cases. The parameters of the model were updated after processing each batch to minimize the aforementioned loss function. This minimization was achieved using a variant of the stochastic gradient descent algorithm called AdamW76. After each epoch, the model was evaluated on the tuning set. An early stopping criterion (across-category mean AUC on the tuning set) was used to determine the continuation of model training based on whether an improvement occurred in the last 10 epochs.
After training finished, the chest X-ray cases in the tuning set were run through the trained model. Prediction scores on cases in the tuning set were used to compute the operating point for each category in the final model. The resulting decision thresholds (one per category) were then fixed and held constant prior to testing.
Standalone evaluation
Standalone testing dataset
The AI systemâs performance was evaluated on a standalone testing dataset consistent with FDA guidelines77. The standalone testing dataset consisted of 20,000 chest X-ray cases that were retrospectively randomly sampled from 12 hospitals and healthcare centers between July and December of 2017 to create a set of patient cases that was representative of the AI systemâs intended use (see Fig. 6). The 12 hospitals and healthcare centers were a subset of the 15 hospitals and healthcare centers that were used for the training dataset. There was a diligent segmentation process to minimize patient overlap between the testing and training sets. There was an overlap of 0.33% of the patients in the training dataset with the patients in the standalone testing dataset, however, the impact of these overlapping patients was negligible on the study results as the performance remained unchanged when these patients were removed from the analysis (overall AUC remained 0.976). Reference standard labels for the standalone dataset were provided by 17 U.S. board-certified radiologists with a median of 14 years of experience post-residency. Each case was independently reviewed by three expert radiologists, and if any findings associated with the category (see Supplementary Information AI System Categories) was identified then the expert radiologist would determine a positive ROI for that category. The reference standard for each category was determined by the majority opinion (at least two of three) of the reviewing expert radiologists. During labeling, each expert radiologist was instructed to make a binary decision on the presence or absence of an abnormality for each of the categories, for each chest X-ray case they were assigned. Twenty thousand cases ensured a large natural positive incidence rate for each category with at least 600 reference standard positive cases per category. See Table 1 for the number of positive cases per category, details about the labels used to create the reference standard, and the patient and image characteristics for the standalone testing dataset. See Fig. 6 for data sampling procedures for the testing datasets. There was no case overlap between the development dataset and the standalone testing dataset.
Performance of the AI system was evaluated using the area under the curve (AUC) of the receiving operator characteristic (ROC) curve per category and overall. The âoverallâ calculations were performed by aggregating the data from all categories and all cases through a technique called micro-averaging78. The sensitivity and specificity were also calculated per category and overall. The 95% confidence intervals were reported for each metric using bootstrap resampling (mâ=â1000).
To measure the localization accuracy of the AI systemâs ROIs, the Intersection-over-Union (IoU) was used to calculate the overlap of the predicted bounding box(es) and the reference standard bounding box(es) for each category. When multiple predicted bounding boxes were present for a category, the areas were summed to find the total predicted area. Similarly, when multiple reference standard bounding boxes were found for a category, the areas were also summed to find the total reference standard area. The IoU is determined by finding the area of the intersection of the predicted area with the reference standard area and dividing by the union of the predicted area with the reference standard area. The reported aggregate IoU per category is the average IoU of the true positive cases.
To measure the performance of the AI system on identifying category-specific ROIs when particular abnormalities are present on chest X-rays, labels were collected from the U.S. board-certified radiologists on a subset of 400 to 800 positive cases per category as determined by the reference standard. The abnormality labels were collected on a random subset of positive cases, therefore, not all abnormalities were represented in the investigation and only those found on over 20 cases were analyzed. The volume of cases with each type of abnormality identified can be found in Supplementary Table 6 along with the AI systemâs sensitivity in detecting the corresponding category-specific ROI.
NIH ChestX-ray8 dataset
The AI system performance was evaluated on a subset of cases from the ChestX-ray8 dataset from the National Institutes of Health to confirm the robustness of the AI system. The complete ChestX-ray8 dataset is made up of over 100,000 de-identified chest X-rays obtained from over 30,000 patients59. We randomly sampled 1,000 cases and 78 cases were removed because they did not meet the inclusion criteria for the AI systemâs indications for use. Thirteen U.S. board certified radiologists with a median of 14 years post-residency labeled each of the 922 cases for the presence or absence of suspicious ROIs for each of the eight categories. We established the reference standard using majority opinion by randomly selecting labels from three of the U.S. board certified radiologists. See Table 1 for details about the labels, as well as the patient and image characteristics of the NIH dataset. The sensitivity, specificity, and AUC of the ROC curve was calculated overall and per category.
Physician evaluation
To evaluate whether physician accuracy improved when aided compared to unaided by the AI system, we conducted a multi-reader, multi-case (MRMC) study consistent with FDA guidelines79. Twenty-four physicians were enrolled to represent the intended use population. The physician specialties included radiologists (nâ=â6), internal medicine physicians (nâ=â6), family medicine physicians (nâ=â6), and emergency medicine physicians (nâ=â6). Physicians had a mean 14.9 years of experience (range: 1â38 years).
Each physician read 238 X-ray cases (190 cases with at least one abnormality and 48 cases with no abnormalities) unaided and aided by the AI system. Physicians were not given access to additional patient information beyond the chest X-ray. The chest X-ray cases used in the clinical testing dataset were retrospectively randomly sampled from 12 hospitals and healthcare centers between July and December of 2017. Enrichment procedures ensured that there were at least 10 positive cases for each category in the clinical testing dataset. See Table 1 for details about the labels used to create the reference standard, as well as the patient and image characteristics for the clinical testing dataset. The reference standard labels were developed with the same methods outlined in the standalone testing dataset section above. See Fig. 6 for data sampling procedures for the clinical testing dataset. There was no case overlap between the development dataset, the standalone testing dataset, and the clinical testing dataset.
A power analysis was conducted in advance of the study, and it determined that using 24 physicians and 238 cases would provide over 90% power to detect a difference in aided versus unaided AUCs of 0.04. The study consisted of two independent reading sessions separated by a washout period of at least 28 days to reduce memory bias. Physicians read all cases twice. In the first session, half the cases were aided by the AI system and the other half were unaided. In the second session, all cases were read in the opposite condition. Cases were assigned using randomized stratification to reduce case order effects. Each physician was asked to determine the presence or absence of an abnormal region of interest for each of the eight categories for every case and provide a confidence score (0â100) for each of their judgments.
To determine whether there was a statistically significant difference between physician performance when aided versus unaided, we used the Dorfman, Berbaum, and Metz (DBM) model. The DBM model is one of the most common methods for estimating differences in the area under the curve (AUC) of the receiver operating characteristic (ROC) curve and for calculating the corresponding confidence intervals for MRMC studies80,81. We also calculated the sensitivity and specificity for each physician type, as well as 95% bootstrap confidence intervals. In addition, we calculated the relative reduction in miss rate rate by subtracting physiciansâ aided miss rate (1 - sensitivity) from their unaided miss rate (1 - sensitivity), and then dividing by their unaided miss rate (1 - sensitivity).
To examine the impact of the AI system on physicians who specialize in medical image interpretation versus those who do not specialize in medical image interpretation, we separated the physicians into two groups, radiologists and non-radiologist physicians (internal medicine, family medicine, and emergency medicine physicians). The âoverallâ AUCs for radiologists and non-radiologist physicians were found by aggregating the data per physician group from all categories and all cases by micro-averaging78. To evaluate the performance between radiologists and non-radiologist physicians AUCs in the unaided and aided setting, we performed the Delong test82 and evaluated the p-values against a Bonferroni-adjusted alpha level of 0.025 (0.05/2). We again calculated the relative reduction in miss rate for radiologists and non-radiologist physicians, respectively, by subtracting each groupâs aided miss rate (1- sensitivity) from their unaided miss rate (1 - sensitivity), and then dividing by their unaided miss rate (1 - sensitivity). To evaluate if the AI system significantly improved case read times, we first took the log transform of the read times to create a normal distribution. Then, we performed a paired t-test to compare log read times in the unaided and aided settings for radiologists and non-radiologist physicians, respectively.
Ethics approval
IRB approval was obtained from the New England IRB (study #1283806).
Data availability
The 1,000 randomly selected cases from the NIH ChestX-ray8 dataset and the corresponding labels are available upon reasonable request. The output of the model and the reference standard labels used to calculate the standalone results in this study are available upon reasonable request. Access to the X-ray images used in the standalone and physician evaluation are not publicly available under Imagen Technologiesâ license.
References
de Groot, P. M., Carter, B. W., Abbott, G. F. & Wu, C. C. Pitfalls in chest radiographic interpretation: blind spots. Semin Roentgenol.50, 197â209 (2015).
Berlin, L. Accuracy of Diagnostic procedures: has it improved over the past five decades? Am. J. Roentgenol.188, 1173â1178 (2007).
Brady, A. P. Error and discrepancy in radiology: inevitable or avoidable? Insights Imaging. 8, 171â182 (2017).
Schaffer, A. C. et al. Rates and characteristics of Paid Malpractice Claims among US Physicians by Specialty, 1992â2014. JAMA Intern. Med.177, 710â718 (2017).
Itri, J. N., Tappouni, R. R., McEachern, R. O., Pesch, A. J. & Patel, S. H. Fundamentals of Diagnostic Error in Imaging. RadioGraphics. 38, 1845â1865 (2018).
Klein, J. S. & Rosado-de-Christenson, M. L. A systematic Approach to chest Radiographic Analysis. in Diseases of the Chest, Breast, Heart and Vessels 2019â2022: Diagnostic and Interventional Imaging (eds Hodler, J. & Kubik-Huch, R. A.) (Springer, 2019).
Van De Luecht, M. & Reed, W. M. The cognitive and perceptual processes that affect observer performance in lung cancer detection: a scoping review. J. Med. Radiat. Sci.68, 175â185 (2021).
Kane, T. P. C., Nuttall, M. C., Bowyer, R. C. & Patel, V. Failure of detection of pneumothorax on initial chest radiograph. Emerg. Med. J.19, 468 (2002).
Houck, P. M., Bratzler, D. W., Nsa, W., Ma, A. & Bartlett, J. G. Timing of Antibiotic Administration and outcomes for Medicare patients hospitalized with Community-Acquired Pneumonia. Arch. Intern. Med.164, 637â644 (2004).
Berlin, L. Defending the âmissedâ radiographic diagnosis. AJR Am. J. Roentgenol.176, 317â322 (2001).
Quekel, L. G. B. A., Kessels, A. G. H., Goei, R. & van Engelshoven, J. M. A. Miss Rate of Lung Cancer on the Chest Radiograph in clinical practice. Chest. 115, 720â724 (1999).
Baker, S. R., Patel, R. H., Yang, L., Lelkes, V. M. & Castro, A. I. Malpractice suits in chest radiology: an evaluation of the histories of 8265 radiologists. J. Thorac. Imaging 28, 388-391 (2013).
Stec, N., Arje, D., Moody, A. R., Krupinski, E. A. & Tyrrell, P. N. A systematic review of fatigue in Radiology: is it a Problem? Am. J. Roentgenol.210, 799â806 (2018).
Griffith, B., Kadom, N. & Straus, C. M. Radiology Education in the 21st Century: threats and opportunities. J. Am. Coll. Radiol. JACR. 16, 1482â1487 (2019).
Kadom, N., Norbash, A. & Duszak, R. Matching Imaging Services to clinical context: why less may be more. J. Am. Coll. Radiol. JACR. 18, 154â160 (2021).
Bhargavan, M., Kaye, A. H., Forman, H. P. & Sunshine, J. H. Workload of radiologists in United States in 2006â2007 and trends since 1991â1992. Radiology. 252, 458â467 (2009).
Lee, C. S., Nagy, P. G., Weaver, S. J. & Newman-Toker, D. E. Cognitive and system factors contributing to diagnostic errors in radiology. AJR Am. J. Roentgenol.201, 611â617 (2013).
Berbaum, K. S. et al. Satisfaction of search in chest radiography 2015. Acad. Radiol.22, 1457â1465 (2015).
Bruno, M. A., Walker, E. A. & Abujudeh, H. H. Understanding and confronting our mistakes: the epidemiology of Error in Radiology and Strategies for Error Reduction. Radiogr Rev. Publ Radiol. Soc. N Am. Inc. 35, 1668â1676 (2015).
Drew, T., Vo, M. L. H. & Wolfe, J. M. The invisible gorilla strikes again: sustained inattentional blindness in expert observers. Psychol. Sci.24, 1848â1853 (2013).
Chan, D. C., Gentzkow, M. & Yu, C. Selection with variation in diagnostic skill: evidence from Radiologists*. Q. J. Econ.137, 729â783 (2022).
Blazar, E., Mitchell, D. & Townzen, J. D. Radiology Training in Emergency Medicine Residency as a predictor of confidence in an attending. Cureus. 12, e6615 (2020).
Schiller, P. T., Phillips, A. W. & Straus, C. M. Radiology Education in Medical School and Residency: the views and needs of Program directors. Acad. Radiol.25, 1333â1343 (2018).
Zwaan, L., Kok, E. M. & van der Gijp, A. Radiology education: a radiology curriculum for all medical students? Diagn. Berl Ger.4, 185â189 (2017).
Saha, A., Roland, R. A., Hartman, M. S. & Daffner, R. H. Radiology medical student education: an outcome-based survey of PGY-1 residents. Acad. Radiol.20, 284â289 (2013).
McLauchlan, C. A., Jones, K. & Guly, H. R. Interpretation of trauma radiographs by junior doctors in accident and emergency departments: a cause for concern? J. Accid. Emerg. Med.14, 295â298 (1997).
Gatt, M. E., Spectre, G., Paltiel, O., Hiller, N. & Stalnikowicz, R. Chest radiographs in the emergency department: is the radiologist really necessary? Postgrad. Med. J.79, 214â217 (2003).
Eng, J. et al. Interpretation of Emergency Department Radiographs. Am. J. Roentgenol.175, 1233â1238 (2000).
Atsina, K. B., Parker, L., Rao, V. M. & Levin, D. C. Advanced Imaging Interpretation by radiologists and Nonradiologist Physicians: a training issue. Am. J. Roentgenol.214, W55âW61 (2020).
Guly, H. Diagnostic errors in an accident and emergency department. Emerg. Med. J. EMJ. 18, 263â269 (2001).
Erickson, B. J., Korfiatis, P., Akkus, Z. & Kline, T. L. Machine learning for medical imaging. Radiographics. 37, 505â515 (2017).
Ãallı, E., Sogancioglu, E., van Ginneken, B., van Leeuwen, K. G. & Murphy, K. Deep learning for chest X-ray analysis: a survey. Med. Image Anal.72, 102125 (2021).
Tang, Y. X. et al. Automated abnormality classification of chest radiographs using deep convolutional neural networks. Npj Digit. Med.3, 1â8 (2020).
Rajpurkar, P. et al. CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning. arXiv preprint arXiv:1711.05225 (2017).
Rajpurkar, P. et al. Deep learning for chest radiograph diagnosis: a retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS Med.15, e1002686 (2018).
Wu, J. T. et al. Comparison of chest radiograph interpretations by Artificial Intelligence Algorithm vs Radiology residents. JAMA Netw. Open.3, e2022779 (2020).
Murphy, K. et al. Computer aided detection of tuberculosis on chest radiographs: an evaluation of the CAD4TB v6 system. Sci. Rep.10, 5492 (2020).
Baltruschat, I. M., Nickisch, H., Grass, M., Knopp, T. & Saalbach, A. Comparison of Deep Learning approaches for Multi-label chest X-Ray classification. Sci. Rep.9, 6381 (2019).
Chouhan, V. et al. A novel transfer learning based Approach for Pneumonia detection in chest X-ray images. Appl. Sci.10, 559 (2020).
Taylor, A. G., Mielke, C. & Mongan, J. Automated detection of moderate and large pneumothorax on frontal chest X-rays using deep convolutional neural networks: a retrospective study. PLOS Med.15, e1002697 (2018).
Kim, C. et al. Multicentre external validation of a commercial artificial intelligence software to analyse chest radiographs in health screening environments with low disease prevalence. Eur. Radiol.33, 3501â3509 (2023).
Rahman, T. et al. Reliable Tuberculosis detection using chest X-ray with deep learning, segmentation and visualization. IEEE Access.8, 191586â191601 (2020).
Wang, H., Jia, H., Lu, L. & Xia, Y. Thorax-Net: an attention regularized deep neural network for classification of thoracic diseases on chest radiography. IEEE J. Biomed. Health Inf.24, 475â485 (2020).
Cicero, M. et al. Training and validating a deep convolutional neural network for computer-aided detection and classification of abnormalities on frontal chest radiographs. Invest. Radiol.52, 281â287 (2017).
Nam, J. G. et al. Development and Validation of Deep Learning-based Automatic Detection Algorithm for malignant pulmonary nodules on chest radiographs. Radiology. 290, 218â228 (2019).
Hwang, E. J. et al. Deep learning for Chest Radiograph Diagnosis in the Emergency Department. Radiology. 293, 573â580 (2019).
Rajpurkar, P. et al. CheXaid: deep learning assistance for physician diagnosis of tuberculosis using chest x-rays in patients with HIV. Npj Digit. Med.3, 1â8 (2020).
Homayounieh, F. et al. An Artificial intelligenceâbased chest X-ray model on human nodule detection accuracy from a Multicenter Study. JAMA Netw. Open.4, e2141096 (2021).
Ohlmann-Knafo, S. et al. AI-based software for lung nodule detection in chest X-rays -- Time for a second reader approach? arXiv preprint arXiv:2206.10912 (2022).
Yoo, H. et al. AI-based improvement in lung cancer detection on chest radiographs: results of a multi-reader study in NLST dataset. Eur. Radiol.31, 9664â9674 (2021).
Kim, J. H. et al. Clinical validation of a deep learning algorithm for detection of pneumonia on chest radiographs in Emergency Department patients with Acute Febrile respiratory illness. J. Clin. Med.9, 1981 (2020).
Benjamens, S., Dhunnoo, P. & Meskó, B. The state of artificial intelligence-based FDA-approved medical devices and algorithms: an online database. Npj Digit. Med.3, 1â8 (2020).
Seah, J. C. Y. et al. Effect of a comprehensive deep-learning model on the accuracy of chest x-ray interpretation by radiologists: a retrospective, multireader multicase study. Lancet Digit. Health. 3, e496âe506 (2021).
Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. Nat. Med.28, 31â38 (2022).
Cutillo, C. M. et al. Machine intelligence in healthcareâperspectives on trustworthiness, explainability, usability, and transparency. Npj Digit. Med.3, 1â5 (2020).
K210666 U.S. Food & Drug Administration.https://www.accessdata.fda.gov/cdrh_docs/pdf21/K210666.pdf (2021).
RadReport. https://radreport.org/home (2021).
Hashimoto, R. et al. Artificial intelligence using convolutional neural networks for real-time detection of early esophageal neoplasia in Barrettâs esophagus (with video). Gastrointest. Endosc. 91, 1264â1271 (2020).
Wang, X. et al. ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. IEEE Conf. Comput. Vis. Pattern Recognit. CVPR 3462â3471 (2017).
Lindsey, R. et al. Deep neural network improves fracture detection by clinicians. Proc. Natl. Acad. Sci.115, 11591â11596 (2018).
Jones, R. M. et al. Assessment of a deep-learning system for fracture detection in musculoskeletal radiographs. NPJ Digit. Med.3, 144 (2020).
Horng, S. et al. Deep learning to quantify pulmonary edema in chest radiographs. Radiol. Artif. Intell.3, e190228 (2021).
Elkin, P. L. et al. NLP-based identification of pneumonia cases from free-text radiological reports. AMIA. Annu. Symp. Proc. 2008, 172â176 (2008).
Johnson, A. E. W. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data. 6, 317 (2019).
Seyyed-Kalantari, L., Zhang, H., McDermott, M. B. A., Chen, I. Y. & Ghassemi, M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat. Med.27, 2176â2182 (2021).
Oakden-Rayner, L. Exploring large-scale Public Medical Image Datasets. Acad. Radiol.27, 106â112 (2020).
Irvin, J. et al. CheXpert: a large chest Radiograph dataset with uncertainty labels and Expert Comparison. Proc. AAAI Conf. Artif. Intell. 33, 590â597 (2019).
Krause, J. et al. Grader variability and the Importance of Reference standards for evaluating Machine Learning models for Diabetic Retinopathy. Ophthalmology. 125, 1264â1272 (2018).
Jain, S., Smit, A., Ng, A. Y. & Rajpurkar, P. Effect of Radiology Report Labeler Quality on Deep Learning Models for Chest X-Ray Interpretation. ArXiv210400793 Cs Eess (2021).
Jain, S. et al. VisualCheXbert: addressing the discrepancy between radiology report labels and image labels. In Proceedings of the Conference on Health, Inference, and Learning 105â115 (2021).
Rosenkrantz, A. B., Hughes, D. R. & Duszak, R. The U.S. Radiologist workforce: an analysis of temporal and Geographic Variation by using large National datasets. Radiology. 279, 175â184 (2016).
Rosenkrantz, A. B., Wang, W., Hughes, D. R. & Duszak, R. A. County-Level analysis of the US Radiologist Workforce: Physician Supply and Subspecialty characteristics. J. Am. Coll. Radiol. 15, 601â606 (2018).
Friedberg, E. B. et al. Access to Interventional Radiology Services in Small hospitals and Rural communities: an ACR Membership Intercommission Survey. J. Am. Coll. Radiol. 16, 185â193 (2019).
Review of an Alleged Radiology Exam Backlog at the W.G. (Bill) Hefner VA Medical Center in Salisbury, NC. https://www.oversight.gov/report/va/review-alleged-radiology-exam-backlog-wg-bill-hefner-vamc-salisbury-nc (2016).
He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. In 2016IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770â778 (2016).
Kingma, D. P., Ba, J. & Adam A. Method for stochastic optimization. ArXiv14126980 Cs (2017).
Health, C. D. and R. Recommended content and format of non-clinical bench performance testing information in Premarket submissions. U S Food Drug Adm.https://www.fda.gov/regulatory-information/search-fda-guidance-documents/recommended-content-and-format-non-clinical-bench-performance-testing-information-premarket (2020).
Sokolova, M. & Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 45, 427â437 (2009).
Health, C. D. and R. Clinical Performance Assessment: Considerations for Computer-Assisted Detection Devices Applied to Radiology Images and Radiology Device Data in Premarket Notification (510(k)) Submissions. U.S. Food and Drug Adm. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/clinical-performance-assessment-considerations-computer-assisted-detection-devices-applied-radiology (2020).
Dorfman, D. D., Berbaum, K. S. & Metz, C. E. Receiver operating characteristic rating analysis: generalization to the Population of readers and patients with the Jackknife Method. Invest. Radiol. 27, 723â731 (1992).
Hillis, S. L., Berbaum, K. S. & Metz, C. E. Recent developments in the Dorfman-Berbaum-Metz Procedure for Multireader ROC Study Analysis. Acad. Radiol. 15, 647â661 (2008).
DeLong, E. R., DeLong, D. M. & Clarke-Pearson, D. L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837â845 (1988).
Acknowledgements
We would like to acknowledge Quartesian, LLC for conducting the primary statistical analyses and Cooper Marshall and Matthew Pillari for assisting in the creation of the figures.
Author information
Authors and Affiliations
Contributions
P.G.A., S.V., S.S., R.V.L., and R.M.J contributed to the conception of the project. P.G.A., D.L.L., S.V., S.S., R.V.L., and R.M.J contributed to data collection. P.G.A., H.T-S., M.A., N.K., S.V., E.B., S.S., R.V.L., and R.M.J contributed to the implementation of the project and the data analysis. P.G.A., H.T-S., M.A., N.K., D.L.L., S.V., E.B., S.S., S.H., R.V.L., and R.M.J. contributed to revising the article and the data analysis. S.V., E.B., and R.V.L. contributed to the design and implementation of the deep-learning models. P.G.A., H.T-S., M.A., N.K., S.V., E.B., and R.M.J contributed to overseeing data analysis and interpretation and drafted the article. All authors approved the completed manuscript.
Corresponding author
Ethics declarations
HIPAA compliance
All Protected Health Information used in the training and validation of this deep learning system was de-identified in compliance with the Healthcare Information Portability and Accountability Act of 1996 (HIPAA)âs Expert Determination method. The study complied with all relevant ethical regulations, and a patient waiver of consent was granted by the New England Independent Review Board because the study presented no risk to patients.
Competing interests
The authors declare the following financial competing interests: Financial support for the research was provided by Imagen Technologies, Inc. P.G.A., H.T-S., M.A., N.K., S.V., E.B., S.S., S.H., R.V.L., and R.M.J. are employees or were employees of Imagen Technologies, Inc. when the research was conducted and the manuscript was drafted. P.G.A., M.A., N.K., D.L.L., S.V., E.B., S.S., S.H., R.V.L., and R.M.J. are shareholders at Imagen Technologies, Inc. The authors declare that there are no non-financial competing interests.
Additional information
Publisherâs note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the articleâs Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the articleâs Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Anderson, P.G., Tarder-Stoll, H., Alpaslan, M. et al. Deep learning improves physician accuracy in the comprehensive detection of abnormalities on chest X-rays. Sci Rep 14, 25151 (2024). https://doi.org/10.1038/s41598-024-76608-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-76608-2