Keywords

1 Introduction

Fluid intelligence is one crucial component of human general intelligence which involves the capacity to think logically, solve problems in novel situations and independent of acquired knowledge [14]. It has been widely accepted that the fluid intelligence reaches a peak in late adolescence and then declines [13]. Furthermore, its quantification and accurate predictions are important for teenagers which foresees creative achievement, scholastic performance, employment pro-spects, socioeconomic status, etc. in their future years [8]. Structural magnetic resonance (MR) images are one of the most powerful tool to help predict the fluid intelligence. The ABCD challenge dataset provides us with a large amount of adolescent participants with structural MR images, aiming at the precise prediction of the pre-Residualized Fluid Intelligence Scores (RFIS), which has been adjusted for different data collection sites, demographic variables and whole brain volume. It includes 3,739 training subjects, 415 validation subjects and 4402 test subjects. For each subject, the cortical volume of 122 regions of interest were extracted from T1 images by the challenge organizers according to the NCANDA pipeline [20].

From previous literature, scientists found that regional brain volumes were closely related to the cognitive status of individuals from the study of Alzheimer disease [2]. While most of the previous literature mainly focus on gray matter cortical metrics [3, 23], recent studies revealed the association of white matter and subcortical volumes with cognitive functions [11, 25]. Cortical thickness can also be related to the cognition according to [7, 15]. Based on those findings we extracted white matter, subcortical regional volumes, etc., by pre-processing the raw T1 MR images using the software Freesurfer [21] to obtain more resourceful explanatory features in different regions of interest (ROIs) for improvement of the prediction performance. We discovered that besides the cortical volumes, structural metrics such as white matter and subcortical volume and mean curvatures were also useful based on our challenge results.

The information extraction and manipulation are two important components for prediction. For information extraction, principal component analysis (PCA) is a common and standard technique in multivariate statistics, which aims to use a set of linear approximations for dimension reduction. However, the classical procedure of PCA could lead to a side effect that the principal components mainly focus on subdomains with large variance. When small groups of highly-correlated covariates exist, important features might get hidden behind [16]. Human brain encompasses complex network structures, and different brain regions can be highly-correlated. Cluster analysis is a typical method of grouping a set of objects into different subsets in terms of their “similarities”. Previous literature has shown that hierarchical clustering results could correspond to brain anatomical configurations [17]. By dividing similar covariates into groups to extract principle components (PCs) within each cluster, we can preserve important features. For information manipulation, statistical modeling and machine learning (ML) methods, such as linear and ridge regression [12], random forest [5], support vector machines [9], etc., remain popular for decades. Among those, the Extreme Gradient Boosting (XGBoost) [6] is extensively used by ML practitioners to create state of art data science solutions, and has gained much attention recently as the choice of many winning teams of ML competitions [24].

We hereby propose to use hierarchical clustering with block PCA to extract important features which are fed into the XGBoost machine for predicting fluid intelligence. Our results show that incorporating block PCA into XGBoost framework leads to better prediction performance than using XGBoost based on either original covariates or traditionally calculated principal components.

2 Method

2.1 Dataset

The ABCD challenge uses NIH Toolbox® [19], Rey Auditory Verbal Learning Test (RAVLT) [10], Little man task [1], etc., to quantify the fluid intelligence scores [22]. The whole dataset we obtained includes 4459 males, 4085 females aged from 8 to 12 years old (107–133 months), and 12 additional individuals with missing demographic information. A detailed distribution of ages and genders for each of the training, validation and test set can be found in Fig. 1. Raw T1-weighted MR images were multi-protocol acquired with Siemans, Philips and GE scanners, which were further processed according to [20]. The cortical ROI volumes for each subject were calculated. With this dataset we describe our workflow in the following subsections and illustrated it in Fig. 4(a).

Fig. 1.
figure 1

Age and gender distribution (left) and number of subjects among training, validation and test datasets.

Fig. 2.
figure 2

Manhattan plots: \(-\log _{10}\) P-value (left), \(-\log _{10}\) FDR adjusted P-value (middle) and the square \(R^2\) (right) of Pearson’s correlation between cortical volumes, two demographic covariates (age and gender), mean curvatures and subcortical volumes and the response RFIS.

Fig. 3.
figure 3

Clustering analysis and PCA based on all the regional cortical volumes

Fig. 4.
figure 4

Workflow and the important predictors.

2.2 The Preprocessing

To predict the RFIS of subjects on validation and test set, we first pre-processed the raw T1 MR images for participants in the whole dataset to extract their brain white matter and subcortical ROI volumes, and the mean curvatures, etc., by using the software FREESURFER’s standard recon-all pipeline (v.6.0.0) [24], which include motion correction, intensity normalization, skull stripping, removal of non-brain tissue, brain mask generation, cortical reconstruction, WM and subcortical segmentation, and cortical parcellation. The white matter volume, subcortical volume, and mean curvature for each ROI and individual were extracted as supplemental information which were combined with cortical volumes, age and gender in Subsect. 2.1 to make predictions. Pearson’s correlations and p-values between all structural metrics with the RFIS were calculated from all subjects in training dataset and displayed in Fig. 2.

We discovered that the age and gender were not significant (p-value > 0.80); there were 22 features with the False Discovery Rate (FDR) adjusted p-values [4] smaller than 0.05, which include: white matter volume of pons and left entorhinal; gray matter volume of left and right parahippocampal gyrus; subcortical volume of right and left globus pallidus and right ventral diencephalon; mean cortical curvature of right and left precentral gyrus, right postcentral gyrus, right paracentral lobule, right and left superior parietal lobule, right Banks of Superior Temporal Sulcus, right superior temporal gyrus, right medial orbital gyrus and right inferior temporal gyrus; number of defect holes in right and total surface. The top 3 features were white matter volume of pons gray matter of right precentral gyrus and white matter volume of left entorhinal, whose FDR adjusted p-values were less than 0.01.

2.3 Clustering and Block PCA

To efficiently extract information from the original datasets, We first perform hierarchical clustering using WardD2 algorithm [18] on the four structural metrics: cortical volumes, WM volumes, subcortical volumes and mean curvature, separately. Looking at the clustering results, we found that cutting the number of clusters at 12 led to the best performance. For each cluster we extracted the first 5 PCs. Hence there could be totally about \(12 \times 5 \times 4=240\) additional features extracted if 5 PCs for each cluster exists. However, in real data there were only 206 PCs generated due to the fact that for very small clusters, the total number of principal components is less than 5. Those additional features were then combined with the original ROI quantitative measures in previous steps to fit the XGBoost model for prediction. In Fig. 3 we provided one example to illustrate the correlation heatmap of the clustering structures for cortical volumes, which indicates a latent correlation structure among all the regions. The Figure also shows that the first 5 principle components for almost all clusters cover more than 60% contribution of variance. Combining the 206 PCs with the original features generated from the previous steps, Pearson’s correlations and p-values with the RFIS response were calculated similar to Subsect. 2.2. After the FDR p-value correction, 30 features were found significantly correlated with the RFIS, with adjusted p-values less than 0.05.

2.4 XGBoost Statistical Models

All features generated in the last three steps on the training set were combined together as explanary varibles to fit a prediction model. First, pvalues based on Pearson’s correlation between all features and the RFIS by only using the training dataset were ranked, where the first \(p_0\) features with the smallest p-values were screened which was then fed into the XGBoost machine. The “GbLinear” booster was used with the default “reg:linear” objective function and the initial prediction score (“base_score”) was set as zero; we set the learning rate as 0.05, and used 10-fold cross validation on the training dataset with the mean absolute error (MAE) as the evaluation metric to select the number of iteration rounds. The \(L_2\) tuning parameter was fixed at the default value 1. We trained XGBoost models with different tuning parameters and then made predictions on the validation set. The optimal \(p_0\) was then selected to minimize the mean squared error (MSE) on the validation set.

We made comparisons between the BlockPC-XGBoost with the XGBoost without Block PCs, and between the BlockPC-XGBoost using all features with BlockPC-XGBoost using the cortical volumes provided by the challenge organizer. The results for both training and validation were shown in the next section.

3 Results

The MSE and R square for the BlockPC-XGBoost using all features, the BlockPC-XGBoost using mostly the grey matter cortical volume provided by the challenge organizer, and the XGBoost without Block PCs for both training and validation were shown in Table 1. The comparisons indicate the advantage of the proposed BlockPC-XGBoost. Importance scores of the top features for the BlockPC-XGBoost were shown in Fig. 4(b). We found that the important predictors did not match the significant features in Fig. 2 exactly. These important predictors were: the gray matter volume of right posterior cingulate gyrus and left caudate nucleus, the entorhinal white matter volume of the left hemisphere, the number of detected surface holes, the globus pallidus volume, several regional lateral ventricle and cerebellum volumes, the mean curvatures of precentral gyrus, postcentral gyrus and Banks of Superior Temporal Sulcus. Among those, the 15 PCs from clustering takes 30% of the top 50 features in total.

Table 1. Prediction results using 3 methods

4 Discussion

From the analysis discussed above, we found several brain areas (white matter volume of pons, gray matter of right precentral gyrus and white matter volume of left entorhinal, etc.) significantly correlated with RFIS. Based on the given features from the challenge organizer and our generated features by Freesurfer software, we used a Block PCA design to learn the representation from all these features, which shows a good learning ability for correlated features. We then used the XGBoost machine to train a prediction model using the learned features, obtaining a result of 69.39 on the validation set. Simultaneously, we found several features which exhibit strong prediction power.

However, the proposed approach is based on segmentation and parcellation of the ROIs, which relies on the image processing precision. Furthermore, the approach does not consider spatial location of all ROIs. Comparing with modern deep learning techniques, e.g., the U-Net, or the graphical model-based deep neural network, it loses local information.