Supplementary Material

1 Systematic Search Methodology

We identified relevant articles by querying PubMed, Web of Science, Google Scholar and arXiv using specific search terms. The search terms used are listed in Fig. LABEL:search_terms. All fields in the database were queried, with the exception of Google Scholar where full texts were searched instead. Articles from 2015 onwards only were included for PubMed and Google Scholar, whereas all years were included for Web of Science and arXiV due to the small number of articles returned.

Next we screened articles based on article title and abstract. We formulated a set of inclusion and exclusion criteria and accepted or rejected articles based on these criteria. The screening criteria are listed in Table 1. Only the first 500 results from Google Scholar were screened because later results were largely irrelevant.

Inclusion/ exclusion criteria for article screening
Include…both in-vivo and ex-vivo imaging.
Exclude…non-human subjects.
Include…the following imaging modalities: structural and functional MRI, CT, PET, DWI/tractography.
Exclude…EEG and MEG data.
Include…both peer reviewed and non-peer reviewed articles.
Exclude…non-English language articles.
Exclude…PhD and Masters theses.
Exclude…reviews, surveys, opinion articles and books. Articles must implement at least one interpretable deep learning method.
Exclude…interpretable methods applied to machine learning models other than neural networks. For example, decision trees, random forests, SVMs, Gaussian processes.
Exclude…for quality control. For example, some methods claimed to be interpretable but were not.

Table 1: Inclusion and exclusion criteria for title and abstract screening

After screening, we extracted data that were relevant to our review questions from all accepted articles into a table. We extracted 27 data points covering 6 different topics: article, imaging, modelling, interpretability method, interpretability method evaluation and study limitations (see Table LABEL:tab:article_data_collection in appendix).

Refer to caption — (a) Number of studies by year

The count of neuroimaging studies applying interpretable deep learning methods have approximately doubled annually¹¹1note, the cutoff date of this review was part way through 2021 (Fig. 1(a)). Most studies used existing public medical image datasets (76%), with the most popular being the Alzheimer’s Disease Neuroimaging Initiative (ADNI, 37% of studies) followed by the Human Connectome Project (HCP, 17% of studies) (Fig. 1(b)). The majority of studies (90%) are either structural or functional magnetic resonance imaging (MRI).