NeuroSynth: MRI-Derived Neuroanatomical Generative Models and Associated Dataset of 18,000 Samples

Sai Spandana Chintapalli Center for AI and Data Science for Integrated Diagnostics (AI2D), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA Rongguang Wang Center for AI and Data Science for Integrated Diagnostics (AI2D), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA Zhijian Yang Center for AI and Data Science for Integrated Diagnostics (AI2D), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA Vasiliki Tassopoulou Center for AI and Data Science for Integrated Diagnostics (AI2D), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA Fanyang Yu Center for AI and Data Science for Integrated Diagnostics (AI2D), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA Vishnu Bashyam Center for AI and Data Science for Integrated Diagnostics (AI2D), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA Guray Erus Center for AI and Data Science for Integrated Diagnostics (AI2D), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA Department of Radiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA Pratik Chaudhari Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, PA 19104, USA Haochang Shou Center for AI and Data Science for Integrated Diagnostics (AI2D), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA Christos Davatzikos Center for AI and Data Science for Integrated Diagnostics (AI2D), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA Department of Radiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA Corresponding author: Christos Davatzikos (christos.davatzikos@pennmedicine.upenn.edu)

Abstract

Availability of large and diverse medical datasets is often challenged by privacy and data sharing restrictions. For successful application of machine learning techniques for disease diagnosis, prognosis, and precision medicine, large amounts of data are necessary for model building and optimization. To help overcome such limitations in the context of brain MRI, we present NeuroSynth: a collection of generative models of normative regional volumetric features derived from structural brain imaging. NeuroSynth models are trained on real brain imaging regional volumetric measures from the iSTAGING consortium, which encompasses over 40,000 MRI scans across 13 studies, incorporating covariates such as age, sex, and race. Leveraging NeuroSynth, we produce and offer 18,000 synthetic samples spanning the adult lifespan (ages 22-90 years), alongside the model’s capability to generate unlimited data. Experimental results indicate that samples generated from NeuroSynth agree with the distributions obtained from real data. Most importantly, the generated normative data significantly enhance the accuracy of downstream machine learning models on tasks such as disease classification. Data and models are available at: https://huggingface.co/spaces/rongguangw/neuro-synth.

Background & Summary

Empowered by widely available open datasets and challenges, in the past decade, machine learning algorithms have surpassed human-level performance on many non-trivial computer vision and natural language understanding tasks including face recognition, object detection, question answering, and translation [1, 2, 3, 4]. Translating the success of artificial intelligence tools from natural image or language applications to medical domain holds immense potential for advancing automated diagnosis and precision medicine. However, medical data usually contains sensitive patient information, which is subject to stringent privacy regulations from hospitals [5, 6]. Consequently, machine learning research in medicine is contained within individual institutes or hospitals, constrained by relatively small sample sizes due to resource limitations and resulting in datasets that lack diversity in terms of patient demographics and pathology representation. Models trained on such less-representative medical data are subject to sample selection bias and may lack generalizability across clinical cohorts [7, 8, 9, 10]. This underscores the need for collaborative efforts to access larger and more diverse medical datasets, ensuring the development of robust and trustworthy AI models in healthcare.

To address such shortages of data specifically in the context of neuroimaging, numerous international consortia have been established. Examples include ENIGMA [11] and iSTAGING [12], which primarily compile structural, functional imaging, and genetic data from tens of thousands of individuals. However, the usage of the data is still restricted to each consortium’s participants. To broaden accessibility and enhance research potential, generative models in machine learning have garnered attention. Techniques such as kernel-density estimation (KDE) [13] and generative adversarial networks (GANs) [14, 15, 16] have shown promises in synthesizing data that faithfully represents the underlying probability distributions of real-world data. This capability opens avenues for generating synthetic datasets that closely resemble actual patient data, thereby augmenting the pool available for analysis and model training. Recently, these models have been utilized to produce synthetic medical data and have proven valuable in many medical applications including chest X-ray screening and skin lesion detection [17, 18, 19, 20]. Consequently, these synthetic datasets can be openly shared and distributed without concerns regarding data privacy. Furthermore, having access to synthetic data that accurately represents a large population can significantly benefit downstream classification models, especially when working with a limited number of labeled examples. By leveraging synthetic data, we can bridge the gap between the available labeled samples and the diverse real-world scenarios, improving the robustness and generalization of our models.In many studies involving MRI (Magnetic Resonance Imaging), brain structure is commonly summarized by region-of-interest (ROI) volumes [21], which are derived from structural T1-weighted MRI scans. ROI volumes are robust brain features that have been validated in many applications including disease diagnosis and prognosis [22, 23, 24, 25, 26], progression modelling [27, 28, 29, 30, 31], and pathology subtype discovery [32, 33, 34, 35].

In this paper, we present NeuroSynth, a collection of generative models that generate normative ROI data over the adult lifespan (age range: 22 to 90 years) for different demographic groups categorized by race and sex. Furthermore, using these models, we offer a dataset of 18,000 synthetic neuroimaging samples representing a diverse global healthy adult population. Notably, we also provide the generative models alongside the dataset to enable users to customize their data synthesis. The dataset includes participants’ demographic information, such as sex, age and race, which can be beneficial for research focusing on mitigating algorithmic bias and promoting fairness. Our approach to generating synthetic data involved training generative models, specifically KDE models, on 34,000 subjects from the iSTAGING consortium [12] to synthesize both brain anatomical ROI volumes and their associated demographics. In the experiments, we assess the quality of the synthetic data from both qualitative and quantitative perspectives. For example, we create plots highlighting the similarity between the real and synthetic data distributions across all ages groups. In terms of quantitative evaluation, we use statistical tests and machine learning models to show the similarity in distributions between synthetic and real data. We also examine the fidelity of the synthetic data by performing covariates prediction including sex or race classification and age regression tasks. Finally, we show the utility of the synthetic data in real world applications such as brain age gap prediction [36].

NeuroSynth, a comprehensive structural brain imaging generator and dataset, aims to significantly contribute to advancing machine learning research within the field of neuroimaging. In particular, NeuroSynth can facilitate: 1) Local Disease Population Comparisons: Researchers can leverage NeuroSynth to compare their own patient populations to NeuroSynth’s normative dataset. To ensure robust inference that is invariant to inter-site differences, researchers should first use part of their control data to harmonize their measures with those of NeuroSynth. 2) Brain Age Prediction Models: NeuroSynth serves as a resource for training brain age prediction models. These models have been used to detect the effects of neurodegenerative and neuropsychiatric disorders on the brain. In the technical validation section, experiments using NeuroSynth have explored the relationship between brain age residuals and cognitive scores.Beyond brain age prediction, the synthetic data generated by NeuroSynth can also be leveraged to train other machine learning models and adapt them for smaller-scale studies through techniques like transfer learning or domain adaptation. 3) Enriching Healthy Controls in Classification Studies: The dataset enhances the healthy control class for discriminative analysis and disease subtyping research. For instance, in an Alzheimer’s disease diagnosis experiment, NeuroSynth has been utilized for data augmentation, leading to more robust results. 4) Synthetic Data Generator Models: Beyond providing the NeuroSynth dataset containing 18,000 samples, researchers also have access to the synthetic data generator models. These pre-trained models allow customization of brain volume synthesis based on factors such as sex, age range, and race groups.

Our goal is to continue expanding NeuroSynth with additional covariates, including genetic risk factors, cognitive scores, and biomarker data. This expansion will enable users to enrich their own datasets by synthesizing highly specific brain ROI measures tailored to their individual studies.

Table 1: Synthetic data table comprising of brain ROI volumes along with demographic information. Sex variable is coded as F for female and M for male.

SampleID	Sex	Race	Age	3rd Ventricle	4th Ventricle	Right Accumbens Area	…
Synth1	F	Black	83.49359824	572.7620447	1298.91257	438.5023773	…
Synth2	M	White	62.13155172	1204.073174	1149.880788	246.0560264	…
.	.	.	.	.	.	.	.
.	.	.	.	.	.	.	.
.	.	.	.	.	.	.	.

Methods

In the following subsections we describe i) the real dataset used to construct the NeuroSynth generative model, and ii) the approach used for training the generative model for synthetic data generation.

Real data

For our real data, we use the iSTAGING consortium [12] that consolidated and harmonized imaging and clinical data from multiple cohorts spanning a wide age range (22 to 90 years). Our data consists of multimodal neuroimaging and demographic measures taken from subjects labeled as cognitively normal in the iSTAGING consortium. Specifically, the neuroimaging measures are the 145 anatomical brain ROI volumes (119 ROIs in gray matter, 20 ROIs in white matter and 6 ROIs in ventricles) from baseline scans extracted using a multi-atlas label fusion method [21]. To mitigate site effects, ComBat-GAM harmonization [37] was applied to these 145 ROI volumes while accounting for age, sex and intracranial volume (ICV). The demographic measures including subjects’ age, sex, and race were accounted in the synthesis. Employing a stratified approach, subjects in the real data were grouped based on race and sex covariates, resulting in six subsets or categories: white male, white female, black male, black female, asian male, and asian female. Before fitting the generative model, within each category, the feature vector of 145 ROI volumes along with the age variable is normalized using the mean and the standard deviation. Mean and standard deviation values for each category were retained to facilitate the back transformation of synthesized data from normalized space to the original space.

Generative model training

We employ a separate non-parametric kernel density estimation (KDE) model for each category, using a Gaussian kernel to delineate the joint probability density of age and the 145 ROI volumes. The selection of the Gaussian kernel is crucial; its smooth, bell-shaped profile is well-suited for generating a continuous representation of the underlying distribution from discrete observations. Within this framework, each data point is represented by a local Gaussian density surface centered at the observed location. Collectively, the overall density is smoothly estimated using composite functions of these local densities, based on the discrete empirical data. This approach facilitates the estimation of the multivariate joint probability density function for age and ROI volumes within each distinct category. Generating new synthetic data involves a two-step process: model fitting and sampling. Initially, a multivariate KDE model is fitted to the real data of each sex and race combination with carefully tuned bandwidth parameters. We then generate new data points by sampling from this refined distribution. Each synthesized vector is of size 146; the first 145 elements correspond to the ROI volumes, and the last element represents age. The sampling process introduces randomness, ensuring that while the synthetic data are not exact replicas of the original, they adhere to the same statistical properties.

Hyper-parameter selection

The bandwidth parameter determines the width of the Gaussian kernel and plays a crucial role in KDE. A smaller bandwidth results in a more granular estimate with greater detail, reflecting minor variations in the data distribution, whereas a larger bandwidth leads to a smoother and more generalizable estimate. By carefully selecting the bandwidth, the KDE model can effectively navigate the trade-off between overfitting, where the model captures excessive noise, and underfitting, where important features of the data distribution are overlooked. To identify the optimal bandwidth for the kernel in Kernel Density Estimation (KDE), we conducted a grid search spanning bandwidths from 0.5 to 1. The selection criterion for the optimal bandwidth was the maximization of the log-likelihood score across these values. For the majority of the categories analyzed, the optimal bandwidth converged at approximately 0.7. The search space of the bandwidth is constrained to 0.5 to 1 because our dataset is normalized ( $\sim N(0,1)$ ). A small bandwidth will overfit the training data, since it is going to fit a narrow Gaussian kernel around any point; whereas a large bandwidth, close to 1, will result in an overly smoothed density estimate that fails to maintain the correlation between age and the 145 ROIs. The model implementation and the bandwidth grid-search are carried out using the scikit-learn library [38].

Data Records

The NeuroSynth dataset and model are hosted at https://huggingface.co/spaces/rongguangw/neuro-synth. The "neurosynth_dataset.csv" file contains 18,000 synthetic samples with 3000 samples allocated per combination of race and sex, a deliberate sampling strategy aimed at ensuring thorough representation across demographic groups. Each sample comprises of 148 features (145 brain ROI volumes along with age, sex, and race information). See Table. 1 for an example of the synthetic dataset. We enhance the utility of our dataset by providing users with access to both the synthetic data and the fitted Kernel Density Estimation (KDE) models. This provision equips researchers with the ability to generate synthetic data that aligns precisely with their research requirements.

Technical Validation

Refer to caption — Figure 1: Comparing synthetic and real data distributions. (a) Qualitative assessment of real versus synthetic ROI volume data distributions in the 3rd ventricle and right hippocampus. (b) Results from univariate statistical analysis to check for differences in synthetic and real ROI volumes. (c) 5-fold cross-validated AUC scores for binary classification of real and synthetic data using support vector machine regression indicate that the generated dataset is indistinguishable from the real dataset.

We conduct a comprehensive validation of the synthetic data through two levels of analysis. In the first level, we employ a combination of statistical and machine learning techniques to ensure that the synthetically generated brain ROI volume data closely mirrors the distribution observed in real ROI volume data across demographic variables such as age, sex, and race. Subsequently, at the second level, we aim to validate the practical use of synthetic data in real-world scientific applications, such as brain age gap estimation and data augmentation. Through these rigorous analyses, we aim to ascertain the accuracy, reliability, and utility of synthetic data in both replicating real data distributions and facilitating meaningful applications in scientific research and clinical practice.

Assessing Fidelity: Statistical and Machine Learning Analysis Comparing Synthetic and Real Data Distributions

In this section, we examine synthetic samples generated from the KDE models in comparison with a held-out dataset containing real samples from the iSTAGING consortium [12]. Given that the synthetic data spans the entire adult lifespan and covers all combinations of race and sex, Fig.1a serves to visually illustrate the similarity in distributions between real and synthetic data across several ROIs. From Fig. 1a it is evident that for these two representative examples, i.e. 3rd ventricle and right hippocampus, there is significant overlap between the generated data samples and the real data samples. Furthermore, the age trend observed in the synthetic data roughly mirrors that observed in the real data. To further evaluate the fidelity of the ROI volume distributions, univariate statistical analysis was conducted via linear regression of each ROI volume on group (real versus synthetic) while adjusting for age as a covariate [39]. Bonferroni correction [40] was applied to adjust for multiple comparisons, resulting in a significance threshold of $\alpha=\frac{0.05}{145}=0.0003$ . Fig. 1b presents the count of ROIs, out of the total 146 ROIs, with statistically significant differences in group means for every combination of race and sex.

Additionally, to assess the quality of the multivariate ROI volume distributions, we use the support vector machine (SVM) [41] classifier that is trained to differentiate between synthetic and real samples. An area under the receiver operating characteristic curve (AUC) [42] of 0.5 would suggest that the classifier is making random predictions and cannot distinguish between the two groups. Therefore, if the SVM fails to differentiate synthetic samples from real samples, we can infer that the generated dataset is indistinguishable from the real dataset. Fig. 1c summarizes the findings across different race and sex categories. While there remains some classification capability (AUC > 0.5 in white males, white and black females), suggesting slight differences between the two distributions, subsequent analyses under practical applicability section demonstrate that this discrepancy does not confound or significantly impact the utility of the synthetic dataset.

Since the data for each combination of race and sex was generated from different KDE models, it is critical to assess whether the relationship between ROI volumes and covariates (age, sex, race) in the synthetic data aligns with observations from real data. To carry out this analysis, we build machine learning models for covariate prediction, specifically we use gradient boosted trees (implemented using XGBoost library [43]). These models are trained to predict covariates based on ROI volume data, performing tasks such as age regression, sex classification, and race classification. By comparing the performance of the models trained on synthetic data with those trained on the real data, we can determine if synthetic data is a suitable substitute for real data. Sex and race classification performance is evaluated using metrics such as accuracy, balanced accuracy, and AUC. Age regression performance is evaluated using mean absolute error (MAE) and Pearson’s correlation [44] between predicted age and ground truth age. Fig.2 shows the results for covariate prediction, we used 5-fold cross-validation during analysis. While models trained on real data generally outperform those trained on synthetic data, it is noteworthy that the performance of models trained on synthetic data is comparable to their real data counterparts. These results suggest that synthetic data can serve as a valuable alternative to real data, when the latter is not available. The next section delves into the practical applicability of synthetic data, exploring its potential benefits and limitations in real-world scenarios.

Practical Applicability

In addition to validating the quality of the generated data, we conducted comprehensive assessments to showcase the practical applications of synthetic data across various scenarios. This included its efficacy in augmenting training datasets for disease classification and deriving clinically meaningful estimates of brain age gaps. In this section, we utilized the ADNI study as a held-out dataset to assess the aforementioned properties. The synthetic data was derived from a KDE model retrained on the remaining studies, following the steps outlined in Methods.

Data augmentation

Deep learning methods usually require a large number of training samples, which are laborious and costly to obtain, especially for brain MRI studies. Moreover, datasets focusing on specific diseases may sometimes lack a sufficient number of healthy controls. To address this, we investigated the feasibility of using synthetic data to supplement the normal control (CN) group. We centered our analysis on the mild cognitive impairment (MCI) or Alzheimer’s disease (AD) classifications. Leveraging data from the ADNI study [45], we evaluated whether synthetic data based data augmentation can help improve classification performance. We divided the ADNI dataset, allocating 500 cognitively normal participants for training, 368 for testing, and evenly dividing 1101 MCI and 419 AD participants between training and testing sets. To ensure the robustness of our findings, we conducted fifty distinct random splits. For both CN vs MCI and CN vs AD classifications, we trained the SVM algorithm on the training set and assessed its performance on the test set using the AUC metric. Notably, the training sets comprised different proportions of CN data from both real and synthetic datasets, allowing us to evaluate the specific contribution of synthetic data to performance enhancement (Fig. 3).
Figure 3 illustrates that a reduced number of training samples from the normal control group (500 v.s. 100) results in lower MCI and AD classification performances, underscoring the importance of augmenting the healthy control dataset. Supplementation of the CN set with synthetic normal control data progressively improved classification AUCs, although the rate of improvement slowed as the proportion of synthetic data within the CN set increased.

Brain age gap estimation

Brain age gap[46, 36, 47, 48, 49], the difference between predicted brain age and actual chronological age, indicates deviations from normal brain aging and proves important for assessing neurological health. Utilizing large-scale synthetic control data can potentially enhance the development of age-prediction models, offering more reliable and clinically relevant brain age gap estimations. We trained XGBoost regression models on both synthetic and real control data, and used the trained models for calculating brain age gaps for CN and MCI/AD participants in the held-out ADNI dataset. We will refer to brain age gaps derived using the synthetic data model as OOD brain age gaps and refer to brain age gaps derived using the real data model as IND brain age gaps. To mitigate biases inherent in predicted brain age estimates, we applied Cole’s method [50] before calculating the brain age gaps. Further, to examine their effectiveness in indicating cognitive decline and underlying neuropathology, we examined their correlations with Mini Mental State Examination (MMSE) scores — a widely utilized cognitive assessment for measuring cognitive impairment.

As shown in Fig. 4, both OOD and IND brain age gaps show no significant correlations with MMSE among CN participants. However, among MCI/AD participants, they exhibit significant correlations ( $p<0.0001$ ). Interestingly, the Pearson’s correlation coefficient for OOD brain age gaps ( $\rho=-0.235$ ) is similar to that observed for IND brain age gaps( $\rho=-0.304$ ). This further underscores the potential of regression models trained on extensive synthetic datasets to provide brain age gap estimation with increased clinical significance.

Usage Notes

By employing rigorous statistical and machine learning analyses, we have demonstrated that the synthetic data generated by NeuroSynth closely aligns with real data distributions across various demographic variables, including age, sex, and race. Our evaluations, detailed in the technical validation section, indicate that the multivariate distributions learned by the model accurately preserve covariate effects, ensuring that the synthetic data maintains the integrity of the original data’s demographic characteristics. The practical utility of NeuroSynth has been illustrated through various applications, showing its potential to serve as either a substitute or complement to real data. In scenarios with class imbalances in real datasets, we have shown how researchers can expand their sample sizes with synthetic data and improve the performance of machine learning models. We have also show the utility of NeuroSynth in brain age predictions. As it is designed to emulate a reference dataset representing a global healthy population across the human lifespan, researchers can leverage NeuroSynth to compare their small cohorts to the reference population, gaining insights into age-related changes and deviations from expected brain development trajectories, thus enhancing our understanding of neurodevelopmental processes and age-related neurodegeneration.
Moreover, synthetic data holds significant potential for harmonization efforts in integrating and comparing datasets from diverse sources. Synthetic data can be specifically tailored to generate covariate-matched control data, ensuring that external studies can be effectively harmonized with existing datasets. This approach enables researchers to create control groups with similar demographic characteristics as that of their study population, thereby reducing confounding effects and then utilizing a harmonization technique [51] to align their external data to the synthetic matched control data. This enhances the robustness of comparative analyses.
While we have demonstrated the potential uses of synthetic data, there are certain limitations that end-users should consider when using NeuroSynth. The quality of synthetic samples is influenced by the selection of the bandwidth parameter in KDE, and although the synthetic data statistically resembles real data, it may contain noise due to kernel smoothing. This noise might affect the results of any analysis performed with NeuroSynth, so users should interpret their results with caution. Additionally, when integrating their own data with NeuroSynth, users should be aware of potential site-related differences that might impact their analyses. There are techniques available that can help the user harmonize their data to NeuroSynth[52] to mitigate site-related effects and facilitate more robust analyses.
In conclusion, NeuroSynth represents an advancement in the generation and application of synthetic neuroimaging data, providing a valuable resource for the neuroimaging community. By continuing to expand NeuroSynth with additional covariates, including genetic risk factors, cognitive scores, and biomarker data, we aim to further enhance its utility and applicability in diverse research contexts. This work underscores the importance of synthetic data in advancing neuroimaging research, promoting data accessibility, and ensuring the development of robust, generalizable machine learning models in healthcare.

Code availability

All our model building and analysis were carried out in python. All our data transformations are described in detail in the methods section. Statistical analyses were conducted via online python packages, statsmodels 0.8.0, SciPy 1.6.3, NumPy 1.16.6 and pandas 0.21.0. Our machine learning experiments were conducted with version 1.1.3 of the Scikit-learn library¹¹1https://scikit-learn.org. The trained KDE models are available at https://huggingface.co/spaces/rongguangw/neuro-synth. Sample code for model training and data generation is available at https://huggingface.co/spaces/rongguangw/neuro-synth/blob/main/script/synthetic_data_generation.ipynb.

References

[1] LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. \JournalTitlenature 521, 436–444 (2015).
[2] He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, 2961–2969 (2017).
[3] Vaswani, A. et al. Attention is all you need. \JournalTitleAdvances in neural information processing systems 30 (2017).
[4] Brown, T. et al. Language models are few-shot learners. \JournalTitleAdvances in neural information processing systems 33, 1877–1901 (2020).
[5] Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F. & Mahmood, F. Synthetic data in machine learning for medicine and healthcare. \JournalTitleNature Biomedical Engineering 5, 493–497 (2021).
[6] Rajotte, J.-F. et al. Synthetic data as an enabler for machine learning applications in medicine. \JournalTitleIscience 25 (2022).
[7] Benkarim, O. et al. Population heterogeneity in clinical cohorts affects the predictive accuracy of brain imaging. \JournalTitlePLoS biology 20, e3001627 (2022).
[8] Wang, R., Chaudhari, P. & Davatzikos, C. Embracing the disharmony in medical imaging: A simple and effective framework for domain adaptation. \JournalTitleMedical image analysis 76, 102309 (2022).
[9] Kopal, J., Uddin, L. Q. & Bzdok, D. The end game: respecting major sources of population diversity. \JournalTitleNature Methods 20, 1122–1128 (2023).
[10] Wang, R., Chaudhari, P. & Davatzikos, C. Bias in machine learning models can be significantly mitigated by careful training: Evidence from neuroimaging studies. \JournalTitleProceedings of the National Academy of Sciences 120, e2211613120 (2023).
[11] Thompson, P. M. et al. The enigma consortium: large-scale collaborative analyses of neuroimaging and genetic data. \JournalTitleBrain imaging and behavior 8, 153–182 (2014).
[12] Habes, M. et al. The brain chart of aging: machine-learning analytics reveals links between brain aging, white matter disease, amyloid burden, and cognition in the istaging consortium of 10,216 harmonized mr scans. \JournalTitleAlzheimer’s & Dementia 17, 89–102 (2021).
[13] Chen, Y.-C. A tutorial on kernel density estimation and recent advances. \JournalTitleBiostatistics & Epidemiology 1, 161–187 (2017).
[14] Goodfellow, I. et al. Generative adversarial nets. \JournalTitleAdvances in neural information processing systems 27 (2014).
[15] Yi, X., Walia, E. & Babyn, P. Generative adversarial network in medical imaging: A review. \JournalTitleMedical image analysis 58, 101552 (2019).
[16] Wang, R. et al. Applications of generative adversarial networks in neuroimaging and clinical neuroscience. \JournalTitleNeuroimage 119898 (2023).
[17] Yu, B., Wang, Y., Wang, L., Shen, D. & Zhou, L. Medical image synthesis via deep learning. \JournalTitleDeep Learning in Medical Image Analysis: Challenges and Applications 23–44 (2020).
[18] DuMont Schütte, A. et al. Overcoming barriers to data sharing with medical image generation: a comprehensive evaluation. \JournalTitleNPJ digital medicine 4, 141 (2021).
[19] Dalmaz, O., Yurt, M. & Çukur, T. Resvit: residual vision transformers for multimodal medical image synthesis. \JournalTitleIEEE Transactions on Medical Imaging 41, 2598–2614 (2022).
[20] Giuffrè, M. & Shung, D. L. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy. \JournalTitleNPJ Digital Medicine 6, 186 (2023).
[21] Doshi, J. et al. Muse: Multi-atlas region segmentation utilizing ensembles of registration algorithms and parameters, and locally optimal atlas selection. \JournalTitleNeuroimage 127, 186–195 (2016).
[22] Davatzikos, C., Bhatt, P., Shaw, L. M., Batmanghelich, K. N. & Trojanowski, J. Q. Prediction of mci to ad conversion, via mri, csf biomarkers, and pattern classification. \JournalTitleNeurobiology of aging 32, 2322–e19 (2011).
[23] Rathore, S., Habes, M., Iftikhar, M. A., Shacklett, A. & Davatzikos, C. A review on neuroimaging-based classification studies and associated feature extraction methods for alzheimer’s disease and its prodromal stages. \JournalTitleNeuroImage 155, 530–548 (2017).
[24] Samper-González, J. et al. Reproducible evaluation of classification methods in alzheimer’s disease: Framework and application to mri and pet data. \JournalTitleNeuroImage 183, 504–521 (2018).
[25] Wen, J. et al. Convolutional neural networks for classification of alzheimer’s disease: Overview and reproducible evaluation. \JournalTitleMedical image analysis 63, 101694 (2020).
[26] Wang, R., Erus, G., Chaudhari, P. & Davatzikos, C. Adapting machine learning diagnostic models to new populations using a small amount of data: Results from clinical neuroscience. \JournalTitlearXiv preprint arXiv:2308.03175 (2023).
[27] Marinescu, R. V. et al. The alzheimer’s disease prediction of longitudinal evolution (tadpole) challenge: Results after 1 year follow-up. \JournalTitleMachine Learning for Biomedical Imaging 1, 1–60 (2021).
[28] Tassopoulou, V., Yu, F. & Davatzikos, C. Deep kernel learning with temporal gaussian processes for clinical variable prediction in alzheimer’s disease. In Machine Learning for Health, 539–551 (PMLR, 2022).
[29] Wijeratne, P. A. et al. The temporal event-based model: Learning event timelines in progressive diseases. \JournalTitleImaging Neuroscience 1, 1–19 (2023).
[30] Aksman, L. M. et al. A data-driven study of alzheimer’s disease related amyloid and tau pathology progression. \JournalTitleBrain 146, 4935–4948 (2023).
[31] Young, A. L. et al. Data-driven modelling of neurodegenerative disease progression: thinking outside the black box. \JournalTitleNature Reviews Neuroscience 1–20 (2024).
[32] Young, A. L. et al. Uncovering the heterogeneity and temporal complexity of neurodegenerative diseases with subtype and stage inference. \JournalTitleNature communications 9, 4273 (2018).
[33] Yang, Z. et al. A deep learning framework identifies dimensional representations of alzheimer’s disease from brain structure. \JournalTitleNature communications 12, 7065 (2021).
[34] Vogel, J. W. et al. Four distinct trajectories of tau deposition identified in alzheimer’s disease. \JournalTitleNature medicine 27, 871–881 (2021).
[35] Yang, Z. et al. Gene-sgan: discovering disease subtypes with imaging and genetic signatures via multi-view weakly-supervised deep clustering. \JournalTitleNature Communications 15, 354 (2024).
[36] Franke, K. & Gaser, C. Ten years of brainage as a neuroimaging biomarker of brain aging: what insights have we gained? \JournalTitleFrontiers in neurology 789 (2019).
[37] Pomponio, R. et al. Harmonization of large mri datasets for the analysis of brain imaging patterns throughout the lifespan. \JournalTitleNeuroImage 208, 116450 (2020).
[38] Pedregosa, F. et al. Scikit-learn: Machine learning in Python. \JournalTitleJournal of Machine Learning Research 12, 2825–2830 (2011).
[39] Dismuke, C. & Lindrooth, R. Ordinary least squares. \JournalTitleMethods and designs for outcomes research 93, 93–104 (2006).
[40] Bland, J. M. & Altman, D. G. Multiple significance tests: the bonferroni method. \JournalTitleBmj 310, 170 (1995).
[41] Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J. & Scholkopf, B. Support vector machines. \JournalTitleIEEE Intelligent Systems and their applications 13, 18–28 (1998).
[42] Pepe, M. S., Cai, T. & Longton, G. Combining predictors for classification using the area under the receiver operating characteristic curve. \JournalTitleBiometrics 62, 221–229 (2006).
[43] Chen, T. & Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 785–794 (2016).
[44] Sedgwick, P. Pearson’s correlation coefficient. \JournalTitleBmj 345 (2012).
[45] Jack Jr, C. R. et al. The alzheimer’s disease neuroimaging initiative (adni): Mri methods. \JournalTitleJournal of Magnetic Resonance Imaging: An Official Journal of the International Society for Magnetic Resonance in Medicine 27, 685–691 (2008).
[46] Smith, S. M., Vidaurre, D., Alfaro-Almagro, F., Nichols, T. E. & Miller, K. L. Estimation of brain age delta from brain imaging. \JournalTitleNeuroimage 200, 528–539 (2019).
[47] Jónsson, B. A. et al. Brain age prediction using deep learning uncovers associated sequence variants. \JournalTitleNature communications 10, 5409 (2019).
[48] Leonardsen, E. H. et al. Deep neural networks learn general and clinically relevant representations of the ageing brain. \JournalTitleNeuroImage 256, 119210 (2022).
[49] Zhou, Z. et al. Multiscale functional connectivity patterns of the aging brain learned from harmonized rsfmri data of the multi-cohort istaging study. \JournalTitleNeuroimage 269, 119911 (2023).
[50] Zhang, B., Zhang, S., Feng, J. & Zhang, S. Age-level bias correction in brain age prediction. \JournalTitleNeuroImage: Clinical 37, 103319 (2023).
[51] Horng, H. et al. Improved generalized combat methods for harmonization of radiomic features. \JournalTitleScientific Reports 12, 19009 (2022).
[52] Hu, F. et al. Image harmonization: A review of statistical and deep learning methods for removing batch effects and evaluation metrics for effective harmonization. \JournalTitleNeuroImage 120125 (2023).

Acknowledgements

The iSTAGING consortium is a multi-institutional effort funded by the National Institute on Aging with grant number RF1AG054409. Data used in preparation of this article were in part obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: https://adni.loni.usc.edu/wpcontent/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf. ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

Author contributions statement

Conceptualization, Methodology, Formal Analysis, Investigation & Writing: Sai Spandana Chintapalli, Rongguang Wang, Zhijian Yang, and Vasiliki Tassopoulou; Conceptualization, Methodology & Formal Analysis: Fanyang Yu and Vishnu Bashyam; Conceptualization: Guray Erus; Conceptualization & Review: Pratik Chaudhari; Review, Editing & Supervision: Haochang Shou and Christos Davatzikos.

Competing interests

The authors declare no competing interests.