Rule-based outlier detection of AI-generated anatomy segmentations

Deepa Krishnaswamy Brigham and Women’s Hospital, Boston, MA, USA Vamsi Krishna Thiriveedhi Brigham and Women’s Hospital, Boston, MA, USA Cosmin Ciausu Brigham and Women’s Hospital, Boston, MA, USA David Clunie PixelMed Publishing, Bangor, PA, USA Steve Pieper Isomics, Cambridge, MA, USA Ron Kikinis Brigham and Women’s Hospital, Boston, MA, USA Andrey Fedorov Brigham and Women’s Hospital, Boston, MA, USA

Abstract

There is a dire need for medical imaging datasets with accompanying annotations to perform downstream patient analysis. However, it is difficult to manually generate these annotations, due to the time-consuming nature, and the variability in clinical conventions. Artificial intelligence has been adopted in the field as a potential method to annotate these large datasets, however, a lack of expert annotations or ground truth can inhibit the adoption of these annotations. We recently made a dataset publicly available including annotations and extracted features of up to 104 organs for the National Lung Screening Trial using the TotalSegmentator method. However, the released dataset does not include expert-derived annotations or an assessment of the accuracy of the segmentations, limiting its usefulness. We propose the development of heuristics to assess the quality of the segmentations, providing methods to measure the consistency of the annotations and a comparison of results to the literature. We make our code and related materials publicly available at https://github.com/ImagingDataCommons/CloudSegmentatorResults and interactive tools at https://huggingface.co/spaces/ImagingDataCommons/CloudSegmentatorResults.

1 Introduction

Availability of annotations is critical for secondary use of public imaging datasets. In the case of large-scale datasets, annotating them manually in a timely manner is not feasible. In recent years, artificial intelligence (AI) tools have shown promise in automatic annotation of both anatomy and pathology, for a variety of use cases. For instance, the TotalSegmentator model can annotate up to 118 anatomical structures in computed tomography (CT) volumes [1] and has recently been extended to magnetic resonance imaging (MRI) [2]. Other popular models include the nnU-Net framework [3] and its set of pre-trained models covering segmentation of the prostate, kidneys, cardiac substructures, and other regions, for multiple modalities. More recently, the SegVol [4] method provides a universal, interactive method for annotating medical images.

Despite the availability of robust pre-trained models, there are still many large publicly available datasets without sufficient annotations. One of the largest collections in the NCI Imaging Data Commons (IDC) repository [5] is data from the National Lung Screening Trial (NLST) [6]. The CT arm of this collection holds over 26K patients scanned over a period of three years, yielding over 200K computed tomography (CT) volumes. Until recently, most of these were unlabeled, complicating their downstream use. Earlier, we used the TotalSegmentator pre-trained model to volumetrically segment the organs and anatomic structures in over 125K of those CT images, and extracted 28 radiomics features for each of the 9.5 million generated segmentations [7, 8]. The annotations generated by this AI-based curation effort are publicly available from IDC.

A practical challenge in the downstream use of AI-generated segmentations is in the lack of certainty about their correctness. The NLST collection does not include ground truth segmentations. While selective visual checks can help build confidence about these annotations, with the total number of over 9.5 million of segmentations and a range of 23-96 segmented structures per individual scan, neither complete review by the experts nor the creation of ground truth for a sample of this dataset is practical. Additionally, it would be beneficial to inform the user of segmentations that may be problematic. As our dataset contains annotations for over 125K+ CT volumes, it is virtually impossible to perform a manual review without the aid of efficient methods for detecting failures.

Several methods have been developed for the analysis of segmentation results without associated expert-derived or ground truth data. These can be divided into two main approaches, ones that focus on predicting overlap metrics [9, 10, 11], and ones that predict the error-specific regions of a segmentation mask [12, 13, 14]. However, all of these machine learning and deep learning approaches suffer from limitations. For instance, they may exhibit low performance when evaluated on data with a different domain from the training dataset. Additionally, the fact that all methods involve training one or more models can be time-consuming.

We therefore propose simple heuristics that aim to help with detecting failures and assessing the performance of the volumetric segmentations. This approach does not require the use of machine learning and may potentially be beneficial beyond the dataset evaluated therein. Given the AI-generated segmentations of over 9.5M structures across 125K+ CT volumes of the NLST collection, we evaluate the developed heuristics and provide interactive tools for visualizing and exploring analysis results, and assessing the presented heuristics. As a surrogate of the effectiveness of those heuristics, we investigated if removal of the results deemed to be failures by the heuristics leads to improved consistency of 1) volumes of left vs right structures such as the ribs 2) within-patient volumes of structures, and 3) vertebral volumes compared to a population in the literature. Additionally, to enable user interaction and exploration of the AI-generated segmentations, we propose multiple tools the user can utilize for benchmarking purposes. Finally, we make all of our code and interactive tools publicly available on https://github.com/ImagingDataCommons/CloudSegmentatorResults and https://huggingface.co/spaces/ImagingDataCommons/CloudSegmentatorResults.

Refer to caption — Figure 1: Example segmentations generated from the TotalSegmentator AI model on a patient from the National Lung Screening Trial cohort, displayed using 3DSlicer [15]. The treemap summarizes anatomic structures segmented across the entire cohort.

2 Methodology

We developed four heuristics for analyzing AI-generated results and identifying problematic segmentations in the absence of expert annotations. Figure 1 displays an overview of the segmentations that we generate using TotalSegmentator, displayed using 3DSlicer [15]. These heuristics are in part based upon the DICOM metadata extracted from the segmentations presented in previous work [7, 8]. Please see the Supplementary materials for details concerning the metadata. Using the set of heuristics, we investigated the effect of them on three studies: 1. left vs right volumes of the ribs, 2. within-patient volumes, and 3. comparison of AI-generated vertebral volumes to a population study.

2.1 Heuristics

The four heuristics are described below. Barring the connected components check, the remaining heuristics yields a true/false flag for each of the individual segments that need to be analyzed, with the "fail" value corresponding to a segment flagged as problematic by a given heuristic rule. We left the connected components in numerical form to let the user adjust the threshold if needed.

1.

Segmentation completeness: Depending on the inferior-to-superior extent of the axial CT scan, some anatomical structures may be included only partially. To remove these incomplete segmentations from further analysis, we evaluated the completeness of the segmentation by ensuring that there was at least one empty slice above and below the segment. Furthermore, we hypothesized that the segmentation might not be complete if it appears on the most inferior or superior transverse slices of the scan, indicating that it could extend beyond the scanned range, despite a theoretical possibility of a perfect alignment occurring on the terminal slices. We note that this heuristic can only indicate whether TotalSegmentator had the opportunity to segment a structure completely, but not the accuracy.
2.

Connected component: Each anatomical region that is segmented volumetrically should be continuous and consist of a single connected component. However, using the pre-trained TotalSegmentator model with minimal post-processing can yield unconnected components for an anatomical region. Using the pyradiomics [16] package, the VoxelNum field proves the number of connected components, which in the ideal case should be one. We therefore used this field to identify segmentations with extraneous or noisy voxels.
3.

Laterality: Segmentation algorithms may produce the incorrect laterality label (left vs right) of a region. To detect this, we evaluated the laterality by using metadata extracted from the segmentations using the pyradiomics [16] general feature CenterOfMass field. This attribute provides the center of mass of the region in the world-coordinate system. Using this system, the coordinates increase from right to left; therefore, we can easily determine if the laterality of a paired structure is correct.
4.

Minimum volume from voxel summation: To our knowledge, the adrenal gland is the smallest organ segmented by TotalSegmentator. The average volume according to [17] is approximately 5 mL. We used the pyradiomics feature Volume from Voxel Summation to discern volume and chose 5 mL as the threshold for all segmentations to remove artifacts.

2.2 Interactive visualization

To facilitate analysis of the segmentation results, we developed an interactive dashboard based on the Streamlit framework (https://github.com/streamlit/streamlit) and hosted it on the free tier of Hugging Face spaces (https://huggingface.co/spaces/ImagingDataCommons/CloudSegmentatorResults). The dashboard consists of two pages. The ‘Summary’ page contains the results of applying each of the four heuristics to each of the segments. The table can be sorted to quickly gain insights into which of the segmentations were flagged as outliers. The "Plots" page features two types of plots and includes filtering options for radiomics feature, anatomical structure, laterality, and the four heuristics. We display upset plots, which show how many segments passed or failed the heuristics, and in what combinations. Additionally, we display violin plots, which demonstrate the distributions of the standard deviation of radiomics features before and after applying the heuristics within a patient. This helps in studying the effect of the heuristics on the consistency of a radiomics feature value distribution for a given anatomical structure within a patient. Both plots are updated dynamically depending on the choice of filters.

2.3 Left vs right volumes of the ribs

The ribs comprise a large portion of the regions segmented by TotalSegmentator, accounting for 24 out of the total 104 segments. However, the ribs may suffer from a number of problems which could result in inaccurate segmentations, for instance, they are relatively small structures. Secondly, training of the TotalSegmentator method utilized data from the RibFrac 2020 challenge [18] (https://ribfrac.grand-challenge.org/dataset) that provided a single segmentation of the ribs, which were then post-processed by the developers of TotalSegmentator into 24 individual segments. Lastly, the portion of the ribs close to the vertebrae was excluded in the original segmentation. Due to these potential issues and the symmetry of the ribs, we chose to focus on the consistency of the left vs right rib segmentation.

For each pair of left vs right ribs, we computed the normalized difference of the volume: (left-right)/(left+right). The differences were computed for four sets of volumes, the original before any filtering is applied, after the segmentation completeness is applied, after the single connected component is applied, and after the laterality heuristic is performed. Multiple linear effects models were used to assess the effect of the addition of each heuristic. In all cases, the patient ID was included as a random effect, to account for the fact that a patient is scanned multiple times.

2.4 Within-patient volumes

Each patient from the NLST study was scanned three times over three years. One or more scans were acquired within each time point (known as a study). Within each study, various convolution kernels were used for reconstruction of the CT scan, in order to enhance and visualize different parts of the chest. According to the NLST protocol [6], a single helical scan was obtained from a patient, and two or three axial reconstructions were performed. The patient was scanned until a satisfactory scan was obtained. Since each patient is scanned over three years, we expect some variability in the organs. Therefore, we chose to study the volumes of vertebrae before and after applying the heuristics, to see if there are significant differences between the distributions.

2.5 Comparison of AI-generated vertebral volumes to a population study

We chose to study the vertebra, as the vertebral segments account for approximately 23% of the possible structures segmented by TotalSegmentator. The volume of the vertebrae in particular is a useful measure for a number of different areas. For instance in osteoporosis, a disease that causes reduced mass of the bones and vertebral fractures, the volume of the vertebra may be used to monitor the progression of the disease [19]. We compared our observations with Limthongku et al. 2010 study [20] of the volume of the lumbar and thoracic vertebrae. In that study CT scans were performed on 40 patients (even distribution of men and women) and the BrainLab software (Munich, Germany) was used to calculate the vertebral volumes. Both interobserver and intraobserver studies were performed to confirm the reliability of the volume measurements.

3 Results

3.1 Heuristics and interactive visualization

The "Summary" page of the dashboard (https://huggingface.co/spaces/ImagingDataCommons/CloudSegmentatorResults)) reveals how many segments corresponding to the individual anatomic structures passed each of the heuristics when applied independently.

1.

Segmentation Completeness: Given that the region of interest in the NLST scans is the chest area, most organs within the thoracic region were segmented completely. The segmentation performance was highest for organs located in the middle of the thoracic region and gradually decreased for organs located towards the outer regions. Only 32% (34/104) of the total organs segmented by TotalSegmentator passed the segmentation completeness check in over 90% of the cumulative total number of segmentations. These successfully segmented organs included 8 thoracic vertebrae (T3-T10), 14 ribs (two through eight), all 5 lung lobes, all 4 heart chambers (atria and ventricles), the pulmonary artery, and the myocardium.
2.

Laterality : Laterality test was performed for all paired organs, a total of 56 organs. TotalSegmentator assigned laterality exceptionally well, with nearly 100% accuracy in 75% (42/56) of the organs segmented, and over 95% accuracy in 98% (55/56) organs.
3.

Connected components: TotalSegmentator uses nnU-Net as the base algorithm, performing segmentations volumetrically. Ideally, we expect a single connected volume per segmentation. All four heart chambers, the pulmonary artery, aorta, myocardium, and deep back muscles had more than 98% of segmentations with a single connected volume. The portal and splenic veins had the lowest number of single volumes, with only 13% (16385/124329) of segmentations being single volumes. All vertebrae except for T2, T3, and T7-12 had less than 80% of segmentations with a single volume. Specifically, C7, L3, and L4 had less than 50% of segmentations with a single volume.
4.

Minimum volume from voxel summation: We found that 58% (60/104) of the organs had 90% or more segmentations with volume of at least 5 mL. As anticipated, the majority of these organs were located in the thoracic region. However, it is noteworthy that the eleventh and twelfth ribs, all lumbar vertebrae (with the exception of L1), and all cervical vertebrae had less than 90% of their segmentations meeting the minimum volume threshold. All organs below abdomen area failed the threshold as well.

In the following we perform a more detailed analysis of the results to evaluate whether they have effect on quantitative feature analysis. Figure 2 displays an example visualization from the dashboard.

3.2 Left vs right volumes of the ribs

We first investigated the effect of applying multiple heuristics to aid in the filtering of series of interest, as demonstrated in Figure 3. We assign the heuristics to the following: A = segmentation completeness check, B = number of connected components = 1, C = volume > 5 mL, D = laterality check, and apply them successively. For each rib, we calculate the normalized difference between the left and the right rib by computing (left-right)/(left+right). In the figure, we plot the mean normalized difference for each rib pair, and a line connecting the mean +/- one standard deviation. We observe that the application of each successive heuristic usually decreases the mean normalized difference, which is satisfactory as we only want to include rib pairs that are symmetric and have a similar volume between the left and the right. We also observe that for most ribs, the application of successive heuristics lowers the standard deviation of the normalized difference, also indicating that outliers are removed from this process. We performed statistical testing to evaluate the effect of applying each of the successive heuristics, and to see if there was a significant difference between the original data and after applying all of the heuristics. Linear mixed-effects modeling was performed between each set of data and a sucessive heuristic applied, taking into account that each patient had repeated measures by considering it to be a random effect. Detailed results can be seen in the supplementary materials, where we observed the significant effect of the segmentation completeness heuristic.

There are cases where the heuristics worked as expected, and removed problematic series. For instance, in Figure 4A, the heuristic correctly identified segmentations that were incomplete, as can be seen for the 12th rib. As the scan does not fully cover the 12th rib, this series was removed from further analysis. However, there are cases where the heuristics did not work as expected. In Figure 4B, we can see an example of one of the series not filtered. The 12th rib is incorrectly labeled, as part of the vertebrae is likely segmented instead. Additionally, it can be seen that the 11th rib is oversegmented and incorrectly labeled, as part of it should be the 12th rib.

3.3 Within-patient volumes

Figure 5 demonstrates the within-patient consistency of the right kidney. For each patient, we compute the standard deviation of the volumes, and plot the distribution of these standard deviations. We compare the original volumes (no heuristics applied) to the distribution of volumes after all heuristics have been applied. We observe a lower standard deviation of volumes after filtering, as well as a larger concentration around the lower median, indicating that filtering likely removed some problematic series. However, despite appying the heuristics, they do not remove all outliers as observed in Figure 5.

3.4 Comparison of AI-generated vertebral volumes to a population study

In Figure 6 we plot on the left the distribution of the number of series per vertebra. We observe that there is a significant drop off in the number of series for the cervical and lumbar vertebrae, which is appropriate as the NLST cohort is for imaging the chest where the thoracic vertebrae are located. Therefore, we focus on extracting the volume features of only the thoracic vertebrae, as seen on the right. Here we can observe the distribution of volumes after applying the heuristics, and an increase in the median volume as one goes from superior to inferior, which agrees with the literature [20].

We also compared the volume values after the four heuristics to the literature, as seen in Figure 7. We observe a similar trend in the thoracic vertebrae for our population vs the paper for both males and females. However, we do observe a large shift in the volumes from our approach vs that of the paper. This is due to the fact that the study from the paper measured the volume of the vertebral body, while our method measured the volume of the entire vertebra, which consists of both the vertebral body (anterior arch) and the posterior arch.

4 Discussion

We have provided a set of heuristics that can be applied to help identify problematic segmentations in large datasets. With our dataset consisting of over 125K CT volumes and up to 104 possible regions segmented in each of those CT volumes, the task of manual evaluation, and identifying failures is next to impossible. In this manuscript we have demonstrated several possibilities of not only how one can identify failures in the segmentation, but how one can quickly summarize datasets and annotations, and perform a comparison to literature. Overall, our provided dataset and annotations enhance further exploration beyond the use cases demonstrated in this paper. Additionally, we make our code and data available in order to encourage reproducibility, and demonstrate how the same qualitative and quantitative heuristics can be used for analysis of any similar data.

There are several applications where the developed heuristics can have immediate impact. First, by detecting the problematic cases we can reduce the noise in the data for studies that are relying on segmentation-derived measurements. Second, by eliminating failed segmentations, the burden of manual review of the AI-generated segmentations can be reduced. This preliminary evaluation ensures that only the most accurate and reliable segmentations are forwarded for manual review.

The heuristics that we developed do suffer from a few limitations. The segmentation completeness heuristic requires an unsegmented slice above and below the segmented region. Segmentation of a single voxel with the empty slices above and below would be qualified as complete. Additionally, the segmentation completeness check may only be useful for certain organs, for instance ones that are more localized in the inferior-superior direction. For organs such as the vessels or veins, or deep muscles of the back, that have a much larger extent in the inferior-superior direction, this type of heuristic may not be beneficial. For the laterality check, we observed that this heuristic did not make a significant difference in assessing failures of the segmentations. However, this is likely due to the robustness and high performance of the TotalSegmentator algorithm. For other algorithms that are either not as robust or are in development, this metric could be potentially useful.

For the heuristic that treats having a single connected component as ideal, there are a few limitations that exist. TotalSegmentator sometimes produces artifacts in many segments. For example, a single connected component could consist of just a single voxel and would not be flagged by the check. On the other hand, a segmentation with two components may be flagged as problematic even if the first component corresponds to a precise segmentation and the second component containing a single voxel. Other limitations include inability to identify cases where one segment is mislabeled as another, or where a segmentation is technically correct, but the entirety of the structure is not segmented. Finally, none of the heuristics we introduced are able to assess the alignment of the boundary of the segmentation with the boundary of the organ/structure in the image. These heuristics cannot replace quantitative assessment of the overlap with the expert segmentations.

There are also some limitations in the analysis that we performed. In the evaluation of the vertebral volume distributions we compared our results with those by Limthongkul et al.[20], which reported the distribution of the volumes of the vertebral body (anterior arch). Our analysis is based on the entire vertebra segmentation, which consists of both the vertebral body (anterior arch) and the posterior arch. Therefore, we could not perform direct comparison with the results of that study. Another limitation of our analyses was the issue of repeated measures. Each patient was scanned multiple times (multiple studies). Within each study, each patient consisted of both series reconstructed from the same scan, and new scans. Our statistical analysis did not account for all the hierarchical degrees of repeated measures, and instead only considered the top level patient-wise repeated measures. Additionally, we did not perform an analysis of how our heuristics perform with regards to gender and race, and if these heuristics unfairly impact specific groups.

The dashboard we developed allows users to quickly filter for regions of interest and visualize the effect of applying different heuristics. However, with more time, there are additional features that would be useful. For instance, we currently can only select a single region at a time, along with a single feature. This limits the amount of comparison that the user can currently perform. Additionally, one may want to compare features for the left vs right structures, which is currently not supported. Improvements for filtering in the table by body part, for instance picking all rib regions, would be helpful.

There are numerous improvements that can be implemented. A portion of our heuristics are dependent on thresholds, such as the volume of the segmented area. Those could be replaced by structure specific values from the literature. Additionally, the heuristics could include additional radiomics features among those available. We would also like to perform more population studies for regions that are associated with the lungs and lung cancer, as those were not specifically a focus in the manuscript. We could also consider the development of AI or machine learning models to capture the uncertainty of the model and perform quality control, by leveraging the TotalSegmentator ground truth data for training. Additionally, the interactive Hugging Face dashboard could be improved in terms of comparing multiple regions and features, and including additional plots and more user-friendly filtering.

5 Conclusion

We have demonstrated the potential of simple heuristics to aid in the quality control and detection of failures in large, annotated datasets without access to ground truth. We have proposed the use of four heuristics to capture and detect issues with if a region is fully segmented, the presence of multiple components, the laterality of the region and the minimum volume from voxel summation. With these measures we have provided methods to easily interact with, summarize, and understand the data. Additionally, we have demonstrated three studies that can be performed with the use of these heuristics, including the studying of the consistency of left vs right rib segmentations, the variability of a region for a patient, and a comparison to literature. While the proposed approach cannot flag all of the problematic segmentation results and has limitations, it is relatively easy to apply and is effective in identifying some of the outliers. We hope our work presented here can stimulate future research into the challenging task of automating quality control for the AI-generated analysis results.

Acknowledgments and Disclosure of Funding

This work was supported in part by NIH NCI under Task Order No. HHSN2611 0071 under Contract No. HHSN261201500003l. Our use of JetStream2 resources was supported by the ACCESS project 230025 [21, 22].

References

Wasserthal et al. [2023] J Wasserthal, HC Breit, MT Meyer, M Pradella, D Hinck, AW Sauter, T Heye, DT Boll, J Cyriac, S Yang, and M Bach. Totalsegmentator: Robust segmentation of 104 anatomic structures in CT images. Radiology: Artificial Intelligence, 5:5, 2023.
D’Antonoli et al. [2024] TA D’Antonoli, LK Berger, AK Indrakanti, N Vishwanathan, J Weiß, M Jung, Z Berkarda, A Rau, M Reisert, T Küstner, and A Walter. Totalsegmentator mri: Sequence-independent segmentation of 59 anatomical structures in MR images, 2024.
Isensee et al. [2021] F Isensee, PF Jaeger, SA Kohl, J Petersen, and KH Maier-Hein. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 2:203–11, 2021.
Du et al. [2023] Y Du, F Bai, T Huang, and B Zhao. Segvol: Universal and interactive volumetric medical image segmentation, 2023.
Fedorov et al. [2023] A Fedorov, WJ Longabaugh, D Pot, DA Clunie, SD Pieper, DL Gibbs, C Bridge, MD Herrmann, A Homeyer, R Lewis, and HJ Aerts. National cancer institute imaging data commons: Toward transparency, reproducibility, and scalability in imaging artificial intelligence. Radiographics, 43(12):e230180, 2023.
Team [2011] National Lung Screening Trial Research Team. Reduced lung-cancer mortality with low-dose computed tomographic screening. New England Journal of Medicine, 365(5):395–409, 2011.
Thiriveedhi et al. [2024a] VK Thiriveedhi, D Krishnaswamy, D Clunie, S Pieper, R Kikinis, and A Fedorov. Cloud-based large-scale curation of medical imaging data using AI segmentation, 2024a.
Thiriveedhi et al. [2024b] VK Thiriveedhi, D Krishnaswamy, D Clunie, and A Fedorov. TotalSegmentator segmentations and radiomics features for nci imaging data commons CT images, 2024b. URL https://zenodo.org/records/8347012.
Kohlberger et al. [2012] T Kohlberger, V Singh, C Alvino, C Bahlmann, and L Grady. Evaluating segmentation error without ground truth. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 528–536, 2012.
Zhou et al. [2019] L Zhou, W Deng, and X Wu. Robust image segmentation quality assessment, 2019.
Wang et al. [2020] X Wang, Q Zhang, Z Zhou, F Liu, Y Yu, Y Wang, and W Gao. Evaluating multi-class segmentation errors with anatomical priors. In 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), pages 953–956, 2020.
Zaman et al. [2023] FA Zaman, L Zhang, H Zhang, M Sonka, and X Wu. Segmentation quality assessment by automated detection of erroneous surface regions in medical images. Computers in biology and medicine, 164:107324, 2023.
Henderson et al. [2022] EG Henderson, AF Green, M van Herk, and Osorio EM Vasquez. Automatic identification of segmentation errors for radiotherapy using geometric learning. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 319–329, 2022.
Valindria et al. [2017] VV Valindria, I Lavdas, W Bai, K Kamnitsas, EO Aboagye, AG Rockall, D Rueckert, and B Glocker. Reverse classification accuracy: predicting segmentation performance in the absence of ground truth. IEEE transactions on medical imaging, 36(8):1597–606, 2017.
Fedorov et al. [2012] A Fedorov, R Beichel, J Kalpathy-Cramer, J Finet, JC Fillion-Robin, S Pujol, C Bauer, D Jennings, F Fennessy, M Sonka, and J Buatti. 3D slicer as an image computing platform for the quantitative imaging network. Magnetic resonance imaging, 30(9):1323–41, 2012.
Van Griethuysen et al. [2017] JJ Van Griethuysen, A Fedorov, C Parmar, A Hosny, N Aucoin, V Narayan, RG Beets-Tan, JC Fillion-Robin, S Pieper, and HJ Aerts. Computational radiomics system to decode the radiographic phenotype. Cancer research, 77(21):e104–7, 2017.
Wang et al. [2012] X Wang, ZY Jin, HD Xue, W Liu, H Sun, Y Chen, and K Xu. Evaluation of normal adrenal gland volume by 64-slice CT. Chinese Medical Sciences Journal, 27(4):220–4, 2012.
Jin et al. [2020] L Jin, J Yang, K Kuang, B Ni, Y Gao, Y Sun, P Gao, W Ma, M Tan, H Kang, and J Chen. Deep-learning-assisted detection and segmentation of rib fractures from CT scans: Development and validation of fracnet. EBioMedicine, 62, 2020.
Kendler et al. [2016] DL Kendler, DC Bauer, KS Davison, L Dian, DA Hanley, ST Harris, MR McClung, PD Miller, JT Schousboe, CK Yuen, and EM Lewiecki. Vertebral fractures: clinical importance and management. The American journal of medicine, 129(2):221–e1, 2016.
Limthongkul et al. [2010] W Limthongkul, EE Karaikovic, JW Savage, and A Markovic. Volumetric analysis of thoracic and lumbar vertebral bodies. The Spine Journal, 10(2):153–8, 2010.
Hancock et al. [2021] DY Hancock, J Fischer, JM Lowe, W Snapp-Childs, M Pierce, S Marru, JE Coulter, M Vaughn, B Beck, N Merchant, and E Skidmore. Jetstream2: Accelerating cloud computing via jetstream. In Practice and Experience in Advanced Research Computing, pages 1–8, 2021.
Boerner et al. [2023] TJ Boerner, S Deems, TR Furlani, SL Knuth, and J Towns. Access: Advancing innovation: NSF’s advanced cyberinfrastructure coordination ecosystem: Services & support. In Practice and Experience in Advanced Research Computing, pages 173–176, 2023.

Supplementary Material

In this supplementary material, we describe the procedure for our analysis of the TotalSegmentator annotations of the NLST dataset. The following are our main contributions:

1.

Dashboard: for exploring features extracted from annotations, and analyzing the effect of filters to detect failures in the segmentations: https://huggingface.co/spaces/ImagingDataCommons/CloudSegmentatorResults
2.

Google Colaboratory notebook: for performing exploratory analysis of the consistency of segmentations and a comparison to measurements from the literature: https://github.com/ImagingDataCommons/CloudSegmentatorResults/blob/main/part2_exploratoryAnalysis.ipynb
3.

Google Colaboratory notebook: for generating the files needed to power the dashboard (item 1) and the analysis (item 2): https://github.com/ImagingDataCommons/CloudSegmentatorResults/blob/main/part1_derivedDataGenerator.ipynb

We begin by describing the dashboard, and what sorts of analysis can be performed. We then describe the analysis we can perform in the notebook and include some additional statistics. Lastly, we describe the formation of tables that are used to power the dashboard and the previous analysis.

We made available the artifacts for this analysis under the MIT license in GitHub and Hugging Face.

Appendix A Dashboard

In order to detect failures in the segmentations, and analyze outliers, we developed an interactive Streamlit dashboard (https://github.com/streamlit/streamlit), hosted on HuggingFace spaces free tier (https://huggingface.co/spaces/ImagingDataCommons/CloudSegmentatorResults). We include two main pages in the dashboard:

1.

Summary page: This page shows the results of applying the four heuristics to each segmentation. These heuristics are the completeness of the segmentation, the laterality check, the number of connected components=1 check, and that the volume > 5mL. The user can quickly investigate which segmentations are outliers using this page. We display the percentage of series that passed each check.
2.

Plots page: In this page, we display two types of plots that are dynamically created based on the user input through drop-down menus and sliders. The user can quickly filter for specific regions of interest, and select which heuristics passed/failed. The first plot displayed is a violin plot, where we plot the distributions of the standard deviations of the features before and after applying the filters. These standard deviations are computed for each patient, as each patient is scanned multiple times, and can provide us with information about the consistency of the particular radiomics feature we’re interested in. The second set of plots shown are upset plots, which demonstrate the number of segments that passed or failed the combinations of heuristics.

Appendix B Exploration of results using derived tables

The Colab notebook (https://github.com/ImagingDataCommons/CloudSegmentatorResults/blob/main/part2_exploratoryAnalysis.ipynb) contains the use cases that we studied in the manuscript. We include the generation of figures that we presented in the paper, starting with the parquet files generated from the first notebook (https://github.com/ImagingDataCommons/CloudSegmentatorResults/blob/main/part1_derivedDataGenerator.ipynb). If you want to skip the creation of the parquet files, you can jump straight to this notebook.

The goal of the manuscript was to explore and understand the effect of the heuristics on the ability to filter out problematic segmentations and identify failures. We focused on three main aspects, where our Colab notebook provides the code needed to reproduce figures in our manuscript for the first and third items:

1.

Consistency of left vs right structures such as the ribs
2.

Within-patient volumes of structures
3.

Vertebral volumes compared to a population in the literature

We now provide further details on the first item from above. For the analysis of left vs right structures, our goal was to study the consistency of the left vs right rib volumes after applying each successive heuristic. We computed the normalized difference between the left and right ribs using (left-right)/(left+right). We then applied each heuristic successively, where one filter = segmentation completeness, two filters = previous filter + connected components = 1, three filters = previous filters + volume > 5mL, and four filters = previous filters + laterality is correct. We performed five different linear mixed-effects modeling tests, to see the effect of each heuristic. We use the PatientID as a random-effect to account for the fact that each patient had multiple scans.

The table below provides the p values and if each test is significant for each pair of ribs ("is sig"). We can observe that by applying the segmentation completeness check (original data vs one filter), there was a significant difference in the normalized volume difference between almost all of the left vs right ribs. We observe the same when adding the additional heuristic of the connected components = 1 (one filter vs two filters). However, checking for the correctness of the laterality and the volume threshold of 5 mL proved to not have significant effects for most of the ribs. We observe that the 12th rib often did not follow the same trend as the other ribs. There was severe under-segmentation in many of the 12th ribs that were still above the threshold of 5 mL and were therefore not flagged by the heuristics.

Rib	Original data		One filter		Two filters		Three filters		Original data
	vs		vs		vs		vs		vs
	one filter		two filters		three filters		four filters		four filters
	p value	is sig	p value	is sig	p value	is sig	p value	is sig	p value	is sig
First rib	0.16	no	<<0.05	yes	<<0.05	yes	1.0	no	<<0.05	yes
Second rib	0.04	yes	<<0.05	yes	0.78	no	1.0	no	<<0.05	yes
Third rib	0.07	no	<<0.05	yes	0.96	no	1.0	no	<<0.05	yes
Fourth rib	<<0.05	yes	<<0.05	yes	0.87	no	0.99	no	<<0.05	yes
Fifth rib	<<0.05	yes	<<0.05	yes	1.0	no	1.0	no	<<0.05	yes
Sixth rib	<<0.05	yes	<<0.05	yes	0.99	no	1.0	no	<<0.05	yes
Seventh rib	<<0.05	yes	<<0.05	yes	0.74	no	1.0	no	<<0.05	yes
Eighth rib	<<0.05	yes	<<0.05	yes	0.70	no	1.0	no	<<0.05	yes
Ninth rib	<<0.05	yes	<<0.05	yes	0.35	no	1.0	no	<<0.05	yes
Tenth rib	<<0.05	yes	<<0.05	yes	0.28	no	1.0	no	<<0.05	yes
Eleventh rib	<<0.05	yes	<<0.05	yes	<<0.05	yes	1.0	no	<<0.05	yes
Twelfth rib	0.92	no	0.07	no	<<0.05	yes	1.0	no	<<0.05	yes

Appendix C Generation of base and derived tables

In this section, we describe the motivation for the generation of the base tables and then the derived tables for evaluating the heuristics. The four heuristics we developed are in part based upon the metadata extracted from the DICOM Segmentation objects, which are publicly available as part of the Google Public Datasets program at https://console.cloud.google.com/marketplace/product/bigquery-public-data/nci-idc-data. However, for the analysis of segmentations, we require the DICOM attribute PerframeFunctionalGroupsSequence as it contains crucial segmentation info such as the slices on which the segmentation is located, segment number, and segment label. This attribute was missing in the majority of the DICOM Segmentation Objects because of the limitations of Bigquery. Bigquery has a limitation of 1 MB per DICOM tag https://cloud.google.com/healthcare-api/docs/how-tos/dicom-bigquery-streaming. Because we encoded all segmentations (up to 104) into a single DICOM Segmentation object, the PerframeFunctionalGroupsSequence was often run over the 1 MB limit. To alleviate this limitation, we extracted the attribute using pydicom and created a workflow on Terra. Please see https://dockstore.org/myworkflows/github.com/ImagingDataCommons/CloudSegmentator/perFrameFunctionalGroupSequenceExtractionOnTerra. We unnested this otherwise nested attribute to create a flat table and then exported it as a parquet file and made it available as one of the base tables on GitHub as https://github.com/ImagingDataCommons/CloudSegmentatorResults/releases/download/0.0.1/nlst_totalseg_perframe.parquet

Secondly, while we encoded only shape and first-order radiomics features into the DICOM Structure Reports, we did not encode the general module https://pyradiomics.readthedocs.io/en/latest/radiomics.html#module-radiomics.generalinfo pyradiomics features as to our knowledge, the DICOM standard did not have the necessary means to encode Center of Mass. So we saved them into a JSON file. When we realized, we could make use of the general features, we extracted them into a Bigquery table first. Subsequently, we exported them as parquet files, and consolidated into a single parquet file. We make available this parquet file as the second and last base table on GitHub as https://github.com/ImagingDataCommons/CloudSegmentatorResults/releases/download/0.0.1/json_radiomics.parquet.parquet.

We used the notebookhttps://github.com/ImagingDataCommons/CloudSegmentatorResults/blob/main/part1_derivedDataGenerator.ipynb) to generate the following derived tables:

•

bodyPartAndLaterality: This is an intermediate table that contains information about the body part segmented by TotalSegmentator, segment number, source CT series, and its laterality.
•

Segmentation Completeness: This table contains info about whether a segment had at least one slice below and above the segmentation.
•

Laterality: This table contains if laterality (left vs right) is correctly assigned by TotalSegmentator.
•

Qualitative Checks: This table contains the three heuristics: segmentation completeness, laterality, and connected components. The fourth heuristic is added when merged with the quantitative measurements below.
•

Flat Quantitative Measurements: This table contains the pivoted quantitative measurements for all TotalSegmentator segmentations. Effectively each row represents a segment and all 28 radiomics features are present in their columns.
•

qualitative_checks_and_quant_measurements: This is the result of combining all the heuristics along with 28 radiomics features for each segment along with the general module features VolumeNum which gives us the number of connected components. This file may be the most useful and is the file that is powering the Hugging Face Dashboard and all our analysis in the part 2 colab notebook.

For all the above tables, we included schema files as well in the same release (https://github.com/ImagingDataCommons/CloudSegmentatorResults/releases/tag/0.0.1)

C.1 Compute Environment

The part2_exploratoryAnalysis notebook was tested on a free colab instance (2 vCPUs, 13 GB RAM) and takes about 2 hrs. For part1_derivedDataGenerator notebook, we initially tested it on a 32vCPUs 256 GB RAM Jetstream2 instance. However, we made several optimizations since then to bring the RAM consumption low despite leading to a longer run times. We were able to run it successfully even on 2vCPUs, 16 GB RAM free tier hugging face jupyterlab space. Run times get better if one has access to better computing resources. Other runtimes, we tested include the 2vCPU 13 GB free colab instance and the 8vCPU, 51 GB Colab Pro High-RAM instance. The runtimes vary anywhere from 4 hrs to 10 hrs. We note that no cloud credentials are necessary as we queried the metadata that is made available for the public for free in AWS buckets. We used duckdb, an in-memory database as it can handle highly complex data in a tiny footprint.