Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/321637395

Macular OCT Classification Using a Multi-Scale Convolutional Neural Network


Ensemble

Article  in  IEEE Transactions on Medical Imaging · December 2017


DOI: 10.1109/TMI.2017.2780115

CITATIONS READS
76 981

4 authors:

Reza Rasti Hossein Rabbani


Duke University Isfahan University of Medical Sciences
9 PUBLICATIONS   187 CITATIONS    236 PUBLICATIONS   2,539 CITATIONS   

SEE PROFILE SEE PROFILE

Alireza Mehri Dehnavi Fedra Hajizadeh


Isfahan University of Medical Sciences Noor Eye Hospital
105 PUBLICATIONS   770 CITATIONS    49 PUBLICATIONS   515 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Microanurysm detection in OCT B-scan images of Diabetic Retinopathy Patients View project

Study of CXCR4 Chemokine Receptor Inhibitors Using QSPR and Molecular Docking Methodologies View project

All content following this page was uploaded by Hossein Rabbani on 21 March 2018.

The user has requested enhancement of the downloaded file.


This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TMI.2017.2780115
1

Macular OCT Classification using a Multi-Scale


Convolutional Neural Network Ensemble
Reza Rasti, Student Member, IEEE, Hossein Rabbani*, Senior Member, IEEE,
Alireza Mehridehnavi, Member, IEEE, and Fedra Hajizadeh

Abstract—Computer-aided diagnosis (CAD) of retinal patholo- perception are traced to the retinal processing of encoding
gies is a current active area in medical image analysis. Due light into neural signals within the macula region.
to the increasing use of retinal optical coherence tomography The macula healthiness can be affected by a number
(OCT) imaging technique, a CAD system in retinal OCT is
essential to assist ophthalmologist in the early detection of of pathologies, including age-related macular degeneration
ocular diseases and treatment monitoring. This paper presents a (AMD), and diabetic macular edema (DME). AMD is a retinal
novel CAD system based on a multi-scale convolutional mixture disease resulting in blurred, blind spots or even no vision in the
of expert (MCME) ensemble model to identify normal retina, center of the visual field and it was the fourth most common
and two common types of macular pathologies, namely, dry cause of blindness in 2013 [1]. According to [2], about 0.4% of
age-related macular degeneration (AMD), and diabetic macular
edema (DME). The proposed MCME modular model is a data- people between 50-60 and approximately 15% of people over
driven neural structure which employs a new cost function for 60 years old are suffering from AMD pathology. Moreover,
discriminative and fast learning of image features by applying diabetic retinopathy accounts for working-age blindness. In
convolutional neural networks (CNNs) on multiple-scale sub- the United States, 12% of all new cases of blindness (people
images. MCME maximizes the likelihood function of the training aged between 20-64 years) in each year is due to diabetic
dataset and ground truth by considering a mixture model, which
tries to also model the joint interaction between individual retinopathy [3]. Generally, DME is the most common diabetic
experts by using a correlated multivariate component for each ex- cause of vision loss in different societies [4]. In the early
pert module instead of only modeling the marginal distributions stages of retinopathy, central vision may be affected by DME.
by independent Gaussian components. Two different macular In diabetic patients, particularly in type II ones, DME is the
OCT datasets from Heidelberg devices were considered for the most frequent sight-threatening complication [5].
evaluation of the method, i.e., a local dataset of OCT images of
148 subjects and a public dataset of 45 OCT acquisitions. For It has been shown that the blindness rate would be reduced
comparison purpose, we performed a wide range of classification by comprehensive screening programs, and early treatment
measures to compare the results with the best configurations of of the eyes with effective diagnostic tools [1]. In ophthal-
the MCME method. With the MCME model of 4 scale-dependent mology, one of the most commonly used imaging technique
experts, the precision rate of 98.86%, and the area under the is optical coherence tomography (OCT) with more than 5
ROC curve (AUC) of 0.9985 were obtained on average.
million acquisitions in the US in 2014 [6]. OCT is a non-
Index Terms—CAD System, Classification, Macular Pathology, invasive imaging technique, which captures cross-sectional
Multi-scale Convolutional Mixture of Experts (MCME), Optical images at microscopic resolution of biological tissues [7]. This
Coherence Tomography (OCT).
developing and high-speed diagnostic imaging technology has
a major contribution to the early identification and treatment
of retinal pathologies today.
I. I NTRODUCTION
Since 3-D OCT image interpretation is a time-consuming
and tedious process for ophthalmologists, different computer-
T HE retina in human eyes receives the focused light by
the lens, and converts it into neural signals. The main
sensory region for this purpose is the macula which is located
aided diagnosis (CAD) systems for semi/fully automatic analy-
sis of OCT data have been developed during current years [8]–
in the central part of the retina. The macula processes light [27]. To this end, various groups have developed computer-
through special layers of photoreceptor nerve cells which are ized algorithms for pre-processing, including denoising and
responsible for detecting light-intensity, color, and fine visual curvature correction [8]–[11], intra-retinal and pathological
details. The retina processes the information gathered by the area segmentation [12]–[18], and 2-D or 3-D OCT classifica-
macula, and sends them to the brain via the optic nerve tion [19]–[27]. Table I summarizes the recent works on CAD
for visual recognition. In fact, necessary features of visual systems in retinal OCT. The table highlights different aspects
or strengths of the reviewed works.
This work was supported in part by Isfahan University of Medical Sciences, In the present study, we propose a novel CAD system for
Department of Biomedical Engineering under Grant #395645. automatic macular OCT classification. The proposed system
R. Rasti, H. Rabbani, and A. Mehridehnavi are with the Department of
Biomedical Engineering, School of Advanced Technologies in Medicine, consists of two main steps. First, in the preprocessing step,
Medical Image and Signal Processing Research Center, Isfahan University of a graph-based curvature correction algorithm is used to re-
Medical Sciences, Isfahan 8174673461, Iran. E-mail:mr.r.rasti@ieee.org, move the retinal distortions, and to yield a set of standard
rabbani.h@ieee.org, and mehri@med.mui.ac.ir.
F. Hajizadeh is with the Noor Ophthalmology Research Center, Noor Eye region/volume of interests (ROIs/VOIs). Second, in the clas-
Hospital, Tehran 1968653111, Iran. E-mail:f edra hajizadeh@yahoo.com. sification step, a data-driven representation solution with a

Copyright (c) 2017 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TMI.2017.2780115
2

TABLE I: Recent works on CAD systems in retinal OCT imaging.


Author Database Method Result Notes
Cirrus HD-OCT: 326 cases Retinal layers alignment + Multi-scale The analysis is limited to a single
Liu AUC=0.93 based on
(4 classes: AMD, MH, ME, LBP feature extraction + RBF-SVM foveal slice which is manually
et, al. [19] the 10-fold CV
and Normal) classification selected by expert ophthalmologists.
B-scan denoising + LBP and HOG
Albarrak 3-D OCT: 140 cases AUC=0.944 based on The presented method relies on a
feature extraction + PCA + Bayesian
et, al. [20] (2 classes: AMD and Normal) 10-fold CV denoising step.
classification
The algorithm relies on a precise
A semi-automatic segmentation of BM,
Farsiu Bioptigen SD-OCT: 384 cases AUC=0.99 based on segmentation of retinal layers and
ILM, and RPE layers + Manually
et, al. [21] (2 classes: AMD and Control) leave-one-out CV requires manual corrections to avoid
feature extraction + Linear regression
misleading outcomes.
CR of 86.67%, 100%, The method relies on a denoising
Heidelberg SD-OCT: 45 cases B-scan denoising + Retinal layers
Srinivasan and 100% for Normal, step. Also, the case results depend
(3 classes: AMD, DME, and alignment + Multi-scale HOG feature
et, al. [22] AMD, and DME based on a threshold of 33% of the B-scans
Normal) extraction + Linear SVM classification
on leave-three-out CV in the volumes as abnormal ones.
An unsupervised feature learning based
Venhuizen Bioptigen SD-OCT: 384 cases The method does not need accurate
on Bag-of-Words (BoWs) method + AUC=0.984
et, al. [23] (2 classes: AMD and Control) layer segmentation.
Random forests (RF) classification
The method relies on a denoising
NLM filtering + Retinal layers flattening Sensitivity=81.2% and
step. Indeed, the algorithm results did
Lematre Cirrus,SD-OCT: 32 cases + LBP-TOP feature extraction + Local Specificity=93.7%
not report for any AMD data (the most
et, al. [24] (2 classes: DME and Normal) mapping + BoWs features + RBF-SVM based on leave-one-out
relevant diagnostic problem in retinal
classification CV
OCT).
B-scan mosaicking of 3D OCT data + The model is limited to 3D OCT
Apostolopoulos Bioptigen SD-OCT: 384 cases AUC=0.997 based on
OCT classification by a 2-D deep CNN data with a same number of B-scans
et, al. [25] (2 classes: AMD and Control) five-fold CV
model (RetiNet C) in different volumes.
Heidelberg Spectralis HRA-OCT: The method does not need accurate
AUC=0.980 with a
3265 eyes (5 classes: No AMD, OCT volume resampling + Detection of layer segmentation. To automatically
Venhuizen sensitivity of 98.2%
Early AMD, Intermediate AMD, AMD affected regions + BoWs feature identify AMD affected regions, a
et, al. [26] at a specificity of
Advanced AMD GA, Advanced learning + Multi-class RF classification simple and coarse layer segmentation
91.2%.
AMD CNV) method is needed.
CR of 93.33%, 100%, The method relies on a denoising
Heidelberg SD-OCT: Set1 [22] B-scan denoising + Retinal layers
100% on set1 and CR step. Three 2-class SVM are used
composed of 45 OCT volumes alignment + Image partitioning + SIFT
Sun of 100%, 99.67%, in max-out strategy and the results
and Set2 included of 678 retinal feature extraction + Dictionary learning
et, al. [27] 99.67% on set2 for reported based on leave-three-out CV.
B-scans (3 classes: AMD, DME, and sparse coding + Multi-scale max
Normal, AMD, and For volumetric diagnosis on dataset1
and Normal) pooling + Linear SVM classification
DME respectively. the maximum vote strategy is chosen.
ME=Macular Edema, MH=Macular Hole, CV=Cross-Validation, CR=Classification Rate, BM=Bruch’s Membrane, ILM=Inner Limiting Membrane,
RPE=Retinal Pigmented Epithelium, GA=Geographic Atrophy, CNV=Choroidal Neovascularization.

full-training approach is introduced and used. In this step, the diagnosis decision on subjects (cases). There are two main
system includes a new deep ensemble method based on con- reasons for choosing this strategy: (i) slice-based analysis of
volutional neural networks (CNNs) [28]–[31]. The presented retinal OCT volumes is the clinical routine in ophthalmology,
method uses the idea of a mixture of multi-scale CNN experts and (ii) different imaging protocols in retinal OCT (as we
as a robust combining method. Generally, the goal of com- see in the datasets) don’t yield consistent slicing and unique
bining methods is to improve the performance in prediction volume sizes to design a full 3-D diagnostic system.
and classification tasks particularly for complicated problems The outline of this paper is organized as follows: Section
involving limited number of patterns, high dimensional feature II describes the database and proposed MCME classification
sets, and highly overlapping classes [32]–[34]. method used in this study. In Section III the ability of proposed
The proposed ensemble model of multi-scale convolutional method is investigated and results are presented. Section
mixture of expert (MCME) is a fast and scale-dependent CNN- IV presents a comprehensive discussion of the method and
based classifier which employs a prior decomposition, and a experimental studies, and finally, this paper is concluded in
new cost function for discriminative and fast learning of image Section V.
representative features.
Compared to the reviewed works in Table. I, the main II. M ATERIAL AND M ETHODS
contributions of the present research are: (i) to analyze the pro-
posed MCME model to address a fully automatic classification A. Database
approach with minimum pre-processing requirements, (ii) the For this research study, the proposed algorithm was de-
capability of multi-slice analysis of volumetric OCTs, includ- signed and evaluated on two different datasets acquired by
ing different slicing and acquisition protocols, (iii) promising Heidelberg SD-OCT imaging systems. The first one is ac-
sensitivity, and (iv) robustness in the retinal OCT diagnostic quired at Noor Eye Hospital in Tehran consisting of 50 normal,
problem. 48 dry AMD, and 50 DME OCTs. For this dataset, the axial
Basically, the proposed CAD system is designed to analyze resolution is 3.5µm with the scan-dimension of 8.9×7.4 mm2 ,
the macular OCT volumes. For this purpose, the MCME model but the lateral and azimuthal resolutions are not consistent for
in the classification stage performs a slice-based analysis by all patients. So, the number of A-scans varies among 512 or
assessing the retinal B-scans in volumes and making a final 768 scans where 19, 25, 31, and 61 B-scans per volume are

Copyright (c) 2017 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TMI.2017.2780115
3

acquired from different patients. The dataset is available at graph based geometry curvature correction algorithm [11] is
http://www.biosigdata.com. used for the retinal flattening based on the detection of the
The second one is a publicly available dataset containing hyper-reflective complex (HRC) band in the retina without any
45 OCT acquisitions obtained at Duke University, Harvard denoising consideration (see Fig. 1). The main idea behind this
University, and the University of Michigan [22]. This dataset algorithm is the construction of graph nodes from each pixel
consists of volumetric scans with non-unique protocols of of the OCT image. Each node is defined in three-dimensional
normal, dry AMD, and DME classes which includes 15 space determined by normalized intensity, horizontal value
subjects for each class. Fig. 1 displays example B-scans from and vertical value of each pixel. The graph edges are then
different SD-OCT volumes of each class. created according to measures of brightness similarity and
closeness, after using a pipeline of morphological operations
and candidate region sorting. These processes are designed
to reduce the graph size to restrict the graph point selection
to an area around the HRC. After generating a matrix of
connectivity, a random walk method is then employed to jump
from one node to another. Finally, based on fitting a second-
order polynomial on the detected HRC, retinal boundary points
are shifted vertically so that HRC points can lie on a horizontal
band.
Using this algorithm, all the B-scans are flattened to conquer
the misalignments of retinal layers. To this aim, in each
Fig. 1: Example B-scans from Normal, AMD, and DME flattened image, the estimated HRC is warped to a vertical
subjects in dataset 1 (Top row) and dataset 2 (Bottom row). line that is located at 70% of the image height. Also, in order
to focus on the region of the retina that contains main morpho-
In addition to labeling at the patient level, all the 4142 logical structures and to reduce the dimension of the image, at
B-scans in dataset 1, and 3247 B-scans in dataset 2 were first, each B-scan is cropped vertically with considering 200
annotated by an ophthalmologist experienced in OCT imaging pixels above and 35 pixels below the estimated HRC. These
to train the proposed 2-D ensemble models. In total, for dataset values were selected for cropping step via a visual inspection
1, the labeled B-scans consist of 862 DME and 969 AMD B- on both datasets to preserve all the retinal B-scan information.
scans. These labeled samples are 856 DME and 711 AMD Finally, each cropped SD-OCT image is resized to 128 × 512
B-scans for dataset 2, where the other B-scans are considered pixels for further processes. Fig. 3 demonstrates examples of
as normal ones. the retinal flattening and cropping steps.

B. Data Preprocessing and VOI Extraction


A general overview of the preprocessing algorithm is shown
in Fig. 2. The algorithm consists the following steps.

Fig. 3: Output samples of the image cropping block. The


flattened and cropped B-scans belong to the dataset 1. The
image size of the flattened outputs is 128 × 512 pixels.

3) ROI selection, VOI generation, and augmentation strat-


Fig. 2: General pipeline of the preprocessing algorithm. egy: In this step, the ROI is selected by cropping a centered
128 × 470 pixels bounding box from each B-scan in a given
1) Normalization: To obtain a unique field of view for case. In next step, all the extracted ROIs in all B-scans are
OCT images, all the B-scans in different volumes are resized resized to 128 × 256 pixels and concatenated to generate the
to 496 × 512 pixels. In addition, in order to handle the case VOI. In learning phase and for training VOIs, in order
intensity variations in OCT images from different patients, a to have an efficient training process, the centered bounding
normalization step is done to remove the intensity mean value box is horizontally flipped and/or translated by ±20 pixels
of each B-scan (zero-mean), and to scale it to have a standard to generate an augmented training collection of ROIs. This
deviation of one. strategy increases the number of samples with a factor of six in
2) Retinal flattening algorithm and image cropping: In our training dataset, reduces the chance of over-fitting, and also
OCT images, due to the anatomical structures and acquisition degrades inconsistency in the data due to a different number
distortions, the retinal layers in B-scans may be shifted or of right and left eyes [25].
oriented randomly. Consequently, it causes high variability 4) Multi-scale spatial decomposition: According to the
in locations in B-scans. In order to deal with this issue, a following motivations, the multi-scale spatial pyramid (MSSP)

Copyright (c) 2017 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TMI.2017.2780115
4

decomposition [35] is applied to macular OCT B-scans before modules [37]. To do that, this module tries to maximize the
feeding them to convolutional ensemble models: (i) some likelihood function of the training dataset and ground truth by
retinal pathologies such as DME exhibit key characteristics at considering a Gaussian mixture model (GMM), in which each
different scales; therefore, it is expected that multi-scale views Gaussian component in the mixture model corresponds to one
of the retina should be presented to the model [19], and (ii) expert module [41], [42].
although CNN-based algorithms benefit from spatial pooling
for providing some built-in invariance to distorted, scaled,
and translated inputs [36], but in the proposed ensemble
model, the prior MSSP decomposition can be used to reduce
the time complexity and effective parameters of the overall
model to reduce the chance of over-fitting and to obtain a
promising performance in practice. Therefore, to assess the
main hypothesis of the study, four levels of the multi-scale
version of Gaussian low-pass image pyramids (i.e., l0 ∼ l3 )
are considered to simulate the multi-view perception of the
proposed model. To do this, the image pyramids are calculated
Fig. 4: The conventional mixture of L experts (classifiers)
for each slice within the VOIs using a symmetric pyramidal
structure: a common signal supplies the input of all modules
decomposition method.
i.e., the experts and gating network.
Suppose that the B-scan I is represented by a 2-D array
with C columns and R rows of pixels. This image is the zero 2) The MCME model: In the proposed MCME method
level (l0 ) of the Gaussian pyramid. Pyramid level l contains for retinal classification, the experts and gating network are
image Il , which is a reduced or low-pass filtered version of constructed by CNNs. The scale-dependent modules used in
Il−1 . Each value within level Il is computed as a weighted MCME enable the model to have a multi-scale perception
average of values in level Il−1 within a 5 × 5 window [35]. of input patterns inspired from visual attention systems. So,
Therefore, the different level pyramids are calculated by: with a symmetric Gaussian decomposition of the input ROIs,
specific pyramidal scales (views) are fed to regular CNNs’
2 2
X X experts and convolutional gating network (CGN). In contrast
Il (i, j) = W(m, n).Il−1 (2i + m, 2j + n), (1) to the traditional ME, suggesting a prior decomposition of
m=−2 n=−2
the inputs can be useful for dividing the task among simpler
Here l refers to the level of pyramid, and positive integers experts in MCME, where the complexity of the overall model
of i and j are respectively less than Cl and Rl , where Cl would be reduced. This is possible by selective information
and Rl are the dimensions of the lth scale. In this study, the fusion of different scales with the gating network. The pro-
separable kernel W (m, n) = w(m).w(n) with w = [(1/4) − posed MCME model is depicted in Fig. 5. It is noted that
(a/2), 1/4, a, 1/4, (1/4) − (a/2)] and a = 0.375 are used for all modules in this structure are learned simultaneously rather
simulation. than independently or sequentially based on an end-to-end
optimization procedure. This strategy provides an ability for
C. Classification Methodology the modules to interact with each other and to specialize in
different parts of automatically extracted feature space.
In the present work, a new modular ensemble method based
on a mixture of CNN experts is used as a robust image-based
classifier. This combined model is originated from the concept
of divide-and-conquer approach in machine learning literature.
It benefits from multi-view decomposition of input patterns to
fuse key information and to solve the classification problem in
a sparse and efficient manner. In the following sub-sections,
the MCME model is presented in detail.
1) Mixture of experts (ME) background: The traditional
ME structure was introduced by Jordan and Jacobs in
1991 [37], and extended by several research groups later
for combining the outputs of multi-layered perceptrons Fig. 5: The MCME structure: the CNN experts and gating
(MLPs) [38]–[40]. network are fed by specific scales of the input pattern.
As illustrated in Fig. 4, this model combines the outputs
of several expert (classifier) networks by training a gating 2.1. MCME signal forward propagation: The MCME model
network. Given an input pattern, each expert network estimates learns a set of scale-dependent convolutional expert networks
the conditional posterior distribution on the partitioned feature fi with scale li−1 : i = 1, . . . , L (in which each expert
space which is separated by a gating network. In fact, the has its own specific input scale) along with a CGN g with
gating network plays a weighting role and causes the overall original input scale (l0 ). Each fi maps the scaled input xi
model to execute a competitive learning process for experts’ to C (one output for each class: AMD, DME, and Normal),

Copyright (c) 2017 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TMI.2017.2780115
5

while gi (x0 ) is a conditional posterior distribution over CNN By calculating the error gradient for all free parameters
experts. The CGN has a Softmax output layer. It assigns a of the experts and gating networks, different optimization
probabilistic weight gi to each expert’s output fi . The gi methods can be used for learning the MCME structure. In
fusion weights are optimized in the learning process where this study, the mini-batch root mean square propagation (RM-
the CGN is simultaneously and interactively trained with the Sprop) procedure [45] is used for parameters updating. In
other individual experts to adaptively predict the contribution the present research in order to classify the input VOIs as
of each scale-dependent CNN experts by encoding the l0 -scale AMD, DME or Normal cases, we hypothesize that an adequate
input into a vector of gi values. The final output of the entire number of scale-dependent CNN experts and a CGN (with l0
MCME structure is then given by Eq. 2 for k th input image: input scale) have the potential to execute a competitive feature
representation process and to yield an efficient combination of
L
X key intensity, shape, texture, and context features with reserv-
FM CM E (xk ) = gi (xk0 ).fi (xki ), ing speed considerations. So, with extending scale fusion, if
i=1 the MCME modules split the learned feature space properly,
L
X (2) the overall classification rate will be increased.
= P (ei |xk0 ).P (c|ei , xki ),
i=1 III. E XPERIMENTAL S TUDY AND R ESULTS
= P (c|xk ). A. Performance Measures
where L is the number of CNN experts (different scales) in Classification performance in this problem was computed
the model, ei indicates the ith expert module, and c is the based on the following evaluation measures. According to the
output class label. 3-classes confusion matrix and receiver operating characteris-
2.2. MCME error back propagation: The traditional mixture tic (ROC) analyses, the values of precision, recall, F1-score,
of experts method can train de-correlated individual experts and average area under the ROC curves (AUC) are used for
modules in a suitable fashion, and benefits from it [43], but performance evaluation of all implemented structures at the
the ME error back propagation procedure does not include any patient level.
control parameter to keep the experts uncorrelated. Inspired For the evaluated ensemble models, in order to explore the
from [44], to control and monitor this ability of the ME method ability of the experts to partition the feature space of the
in the training algorithm, we added a cross-correlation penalty problem, average correlation coefficient, Cohen’s kappa (κ),
term to MCME error cost function. Therefore, total error cost and distance-based disagreement (DbD) factors are considered.
function of MCME for k th input pattern is defined as: Direction and strength of a linear relationship between CNN
experts in the model can be indicated by the correlation
coefficient. Indeed, κ is a statistical measure of classifiers
L  
!
2
gi (xk0 ).e 2 k i )k +λ.ρi
− 1 dk −fi (xk k
X
k
EM CM E (x ) = − ln , agreement that expresses the level of agreement (−1 ≤ κ ≤ 1)
i=1 between each pair of different estimators on a classification
(3) problem [46]. The value of 1 means a complete agreement
where fi , d, and ρi are output vectors of CNN expert i, the (the lower the κ, the less the agreement). Moreover, the
desire output of the input sample xk , and the cross-correlation DbD factor represents experts’ disagreements in which the
penalty term, respectively. Here, ρi is defined as follows: confusion matrices are used to compute distances for each
individual experts [47]. For the MCME model with experts
L
1, 2, . . . , L, a distance measure Dl between expert l and all
1 X
other experts is calculated by Eq. 5 in which cmli,j are the
ρki = fi (xki ) − FM CM E (xk )

L−1 (4) confusion matrix elements for expert l:
j=1;j6=i
T
. fj (xkj ) − FM CM E (xk ) , N N L
1 XX X
The strength of this penalty is adjusted explicitly with the Dl = |cmli,j − cmki,j |, l = 1, 2, ..., L. (5)
N i=1 j=1
control parameter 0 ≤ λ ≤ 1. It is obvious that for λ = 0 k=1;k6=l
there is no cross-correlation penalty term in the MCME cost In the above formula, all confusion matrices are considered
functions. with N classes. Therefore, for the MCME model with L
From the technical point of view, the purpose behind adding experts, the DbD measure (DbD ≥ 0) can be expressed by
ρ penalty term is to negatively correlate each CNN expert’s the following definition:
error with the other CNN experts’ errors in the MCME
ensemble. As explained before, ME model tries to maximize L
1X l
the likelihood function of the training dataset and ground truth DbD = D, (6)
L
by considering GMM, in which each expert module is modeled l=1

by an independent multivariate Gaussian. However, in MCME, In the DbD formula, confusion matrix information of in-
correlated multivariate components are employed which also dividual experts is used to express average experts’ disagree-
model the joint interaction between individual experts instead ment. A higher DbD value indicates more disagreement in
of only modeling the marginal distributions. class prediction for experts.

Copyright (c) 2017 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TMI.2017.2780115
6

TABLE II: Details of single-scale convolutional expert networks structures.


Number First Other Max- Number of Number Number of Number
Input Input
Module of convolutional convolutional pooling FMs in C of FC1 FC2(Output) of
scale size
layers mask size mask size size and P layers neurons neurons parameters
CNN1 l0 128 × 256 19 5×5 3×3 2×2 3 15 3 2993
CNN2 l1 64 × 128 16 5×5 3×3 2×2 3 15 3 1901
CNN3 l2 32 × 64 13 5×5 3×3 2×2 3 15 3 1381
CNN4 l3 16 × 32 10 5×5 3×3 2×2 3 15 3 997
Note: FM is the number of feature maps. C, P and FC indicate the convolutional, pooling, and fully-connected layers respectively.

B. Validation and Diagnostic Strategy (e.g., 224 × 224 pixels for the VGG19 model), and
In this study, to evaluate and generalize the performance of (ii) duplication of the gray channel for 3 times to
the model to an independent dataset, the unbiased 5-fold cross- construct the RGB input (e.g., 224 × 224 × 3 input
validation method is considered and applied at the patient level dimensions for the VGG19 model).
(OCTs). For this purpose, SD-OCT volumes of each dataset – Single-scale CNNs: 4 different structures are con-
in the study are first stratified and partitioned into 5 equally sidered according to Table II to see single-scale
sized folds to ensure that each fold is a good representative of CNN classification performance. All of the CNNs
the whole. Subsequently, 5 iterations of training and validation are constructed based on sequences of CONV-BN1 -
are performed such that within each iteration a different fold POOL combination in hidden layers and ended up
of the data is held out for validation while the remaining 4 with two stacks of fully-connected (FC)-BN layers,
folds are used for the learning process. including 15 and 3 output neurons, respectively. In
Furthermore, to perform a balanced learning process, an order to reduce over-fitting probability during the
approximately comparable number of normal B-scans with learning process, an optimized dropout factor of 70%
AMD or DME sets are considered randomly for selected is considered for all FC1 layers too.
volumes in each dataset. Also, any dataset-dependent bias is • Convolutional ensemble methods:
removed by preserving the random seed across all iterations. – Ave-ensemble model: A combination method com-
As mentioned before, the learning process is performed monly used in the convolutional ensemble literature
based on the selected cases in training folds with considering is the averaging of different CNN subnetworks in the
the augmentation method for their B-scans. Finally, the diag- output layers. Generally, this method trains several
nosis decision by a trained model for test VOIs is obtained CNNs from the available ground truth labels. For a
by the following role: in a given 3-D volume if more than given input image, the real outputs of the trained
15 percent of B-scans (τ = 15%) are predicted as abnormal CNNs are averaged to generate the final output of
ones, the maximum probability of B-scans’ votes (according the ensemble [52], [53]. To consider this method, we
to AMD or DME likelihood scores) determines the type of used 4 different scales (i.e., l0 to l3 ) and the CNNs
patient retinal disease. This threshold value will be further in Table II. This ensemble method is called as the
evaluated in Sub-section III-E. Ave-ensemble model in the rest of this paper.
– Full-rank convolutional mixture of experts (FCME)
C. Study Design model: To get better insight into the proposed prior
decomposition in convolutional ME model [54], and
1) Baseline studies: In this study, to illustrate the profi- also the proposed cost function, FCME model is
ciency of the proposed strategy as well as to get a criterion considered and analyzed (see Fig. 4). Following this
for the comparison of MCME’s performance and complexity purpose, the FCME structure is investigated using the
in retinal OCT classification, the following baselines are full-rank combination of 2, 3, and 4 similar experts
evaluated: with l0 input scale (i.e., CNN1 in Table II).
• Feature-based HOG-LSVM classification method [22].
2) Characterization of the MCME model: This experiment
• Convolutional non-ensemble methods:
is designed to assess the potential of proposed MCME model
– Off-the-shelf competitive CNNs: The VGG19 [48], with considering different scales. To this end, the number of
ResNet50 [49], and InceptionV3 [50] models are experts (scales) influencing the performance of convolutional
evaluated for the comparison purpose. To do this, and mixture structure is investigated versus simple experts. Fol-
for each considered deep network, after modification lowing this purpose, to get better insight into the MCME
of the l0 scale according to the input receptive field of performance in the classification of retinal pathologies in
the model, the CNN codes are extracted as the visual the datasets, a low to high-resolution strategy is performed.
features. The extracted features are then classified by Therefore, the single-scale CNNs are considered for expert
a Softmax classifier including three output neurons modules (according to Table II). Moreover, the CGN module
and an optimized dropout factor of 30% for its is designed to have a similar topology such as CNN1 but its
visible layer. Here, the input modification includes:
(i) l0 resizing according to the network input size 1 Batch normalization layer [51].

Copyright (c) 2017 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TMI.2017.2780115
7

TABLE III: Details and the average performance of the baselines and the proposed structures on dataset 1 according to the
5-fold cross-validation, the threshold of 15% for decision making, and optimum λ values for ME models.
Performance
Method Configuration Best λ Precision(%) Recall(%) F1(%) AUC MSE Correlation DbD Kappa Tr.time(s/ROI)
Feature-based HOG+LSVM [22] − 85.35±9.51 82.56±11.2 82.09±11.1 0.903 0.311 − − − −
Off-the-shelf VGG19 [48] − 92.65±4.00 91.39±5.69 91.27±5.65 0.935 0.224 − − − −
CNN models ResNet50 [49] − 95.31±3.40 94.67±4.00 94.62±3.99 0.960 0.115 − − − −
InceptionV3 [50] − 93.32±2.97 92.06±3.97 91.80±4.24 0.941 0.263 − − − −
CNN1 (l0 ) − 97.50±1.31 97.33±1.33 97.28±1.38 0.991 0.027 − − − 0.086
Single-scale CNN2 (l1 ) − 96.95±2.11 96.63±2.11 96.55±2.35 0.995 0.034 − − − 0.041
CNNs CNN3 (l2 ) − 95.58±6.21 93.32±10.1 92.06±12.7 0.986 0.068 − − − 0.026
CNN4 (l3 ) − 78.80±17.3 77.17±7.42 73.55±12.4 0.963 0.351 − − − 0.015
Ave-ensemble
l3 − l2 − l1 − l0 − 96.45±0.98 95.96±1.33 95.95±1.33 0.994 0.040 0.31 0.67 0.77 0.169
model [52], [53]
l0 − l0 0.1,0.2 98.24±1.48 97.97±1.63 98.01±1.64 0.994 0.020 −0.03 0.34 0.57 0.150
FCME-model [54] l0 − l0 − l0 0.1 98.83±1.48 98.64±1.63 98.67±1.64 0.992 0.013 0.11 0.58 −0.02 0.193
l0 − l0 − l0 − l0 0.3 97.05±2.76 96.62±2.98 96.57±3.13 0.993 0.034 0.02 0.77 0.07 0.265
l1 − l0 0.4 98.83±1.48 98.66±1.63 98.68±1.64 0.995 0.013 0.04 0.51 0.41 0.120
l2 − l0 0.2,0.5,0.7 98.24±1.48 97.97±1.63 98.01±1.64 0.994 0.020 −0.05 0.70 0.26 0.109
l2 − l1 0.4 97.93±3.01 97.29±3.89 97.39±3.81 0.994 0.027 0.03 0.52 0.39 0.088
l3 − l0 0.3 98.34±2.25 97.99±2.67 98.01±2.69 0.992 0.020 0.07 0.56 0.34 0.115
l3 − l1 0.3 98.21±1.48 97.97±1.63 97.99±1.64 0.995 0.040 −0.01 0.62 0.29 0.107
MCME-model l3 − l2 0.2 98.34±2.25 97.96±2.67 97.99±2.69 0.994 0.020 −0.04 0.98 0.14 0.085
l2 − l1 − l0 0.1 98.83±1.48 98.64±1.63 98.67±1.64 0.997 0.013 0.03 0.53 0.13 0.143
l3 − l1 − l0 0.1 98.83±1.48 98.64±1.63 98.67±1.64 0.995 0.013 −0.01 0.47 0.30 0.139
l3 − l2 − l0 0.4 98.24±1.48 97.97±1.63 98.01±1.64 0.992 0.020 0.07 0.51 0.09 0.127
l3 − l2 − l1 0.6 97.76±2.11 97.30±2.49 97.38±2.45 0.993 0.027 −0.06 0.51 0.30 0.101
l3 − l2 − l1 − l0 0.2 99.39±1.21 99.36±1.33 99.34±1.34 0.998 0.006 −0.04 0.79 0.03 0.170
Note: li indicates the MSSP decomposition level of the input ROI for scale-dependent CNNs in models.

output layer contains 2, 3 or 4 Softmax neurons according to D. Performance Results


2-, 3-, or 4-scale MCMEs, respectively. 1) Dataset 1: Table III shows the details and average
Consequently, the MCME structure is investigated by testing results of the evaluated structures according to the 5-fold cross
any combination of 2, 3, or 4 experts (scales). According to a validation method on dataset 1 and the diagnostic threshold
grid search on a nested 5-fold cross-validation within the folds (τ ) of 15%. Following the reported performances, the MCME-
training set, parameter optimization is carried out considering model with the fusion of l3 −l2 −l1 −l0 scales, and at λ = 0.2,
the precision metric. The considered structures are: outperforms the other methods and configurations. It presents
an AU C of 0.998 with a precision of 99.39% at a recall of
• Two-scale MCME: 6 different structures of two-scale 99.36% on this dataset.
MCME are considered, i.e., l3 − l2 , l3 − l1 , l3 − l0 , l2 − The precision results of the best MCME models and the
l1 , l2 − l0 , and l1 − l0 . FCME baselines are compared based on λ factor in Fig. 6.
• Three-scale MCME: 4 different structures of three-scale
MCME are explored, i.e., l3 −l2 −l1 , l3 −l2 −l0 , l3 −l1 −l0
and l2 − l1 − l0 .
• Four-scale MCME: the complete combination of 4 ex-
perts (l3 − l2 − l1 − l0 ) is evaluated for the comparison
purpose.

For the convolutional structures, the training process is


executed based on the mini-batch RMSprop algorithm [45]
(training parameters: lr = 0.001, rho = 0.9, batch size =
32, M ax epoch = 50, and decay = 10−5 ). In off-the-
shelf CNN baseline study, the ”cross-entropy” cost function
is considered for the training of the Softmax classifiers.
Furthermore, the proposed cost function is optimized by the
control parameter λ varying between 0 and 1 with a step of 0.1
for the mixture ensemble models. In all experts, the “Sigmoid”
activation function is considered for output layers (i.e., FC2
layers), while the corresponding function is the “Softmax” for
the CGN. Moreover, the “ReLU” function is selected for all Fig. 6: Comparison of the precision rate of the best ME
hidden layers, and the modules are initialized by the Glorot- structures with different numbers of experts/scales on dataset
Uniform method [55]. 1 at τ = 15%.

Copyright (c) 2017 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TMI.2017.2780115
8

2) Dataset 2: In this experiment, the best multi-scale MCME method. The second dataset was utilized to evaluate
structures on dataset 1, i.e., the l1 − l0 MCME, l2 − l1 − l0 the proficiency of the selected MCME configuration. This
MCME, and l3 − l2 − l1 − l0 MCME models are evaluated by set is a publicly available dataset of 45 volumetric SD-OCT
exploring optimal λ parameter at τ = 15% on dataset 2. To acquisitions [22]. To this end, the evaluation of the proposed
this end, the precision measure is calculated to be 95.67%, ensemble method was done by the ROC, and confusion matrix
98.33%, and 96.67%, respectively for the models. So, the analyses.
l2 − l1 − l0 MCME model at λ = 0.7 outperforms the other
models with AU C of 0.999, recall of 97.78%, and F1-score
A. MCME Analysis based on Dataset 1
of 97.71% where the MSE, mean correlation coefficient, DbD,
and κ values of this model are 0.02, −0.05, 0.51, and 0.16, For dataset 1, by considering the 5-fold cross validation and
respectively on this dataset. the diagnostic threshold τ = 15%, an exhaustive experimental
study was carried out for all the possible scale-fusion config-
E. Diagnostic Sensitivity urations for the proposed MCME model. The results are sum-
marized in Table III. As the best results, an overall precision
In this experiment, different values of the diagnostic thresh- of 99.39%, recall of 99.36%, F1-score of 99.34%, and AUC of
old value τ are explored to investigate the effects of this 0.998 was achieved for the classification of the patients with
parameter on the performance of the selected MCME models an MCME model including 4 experts with l3 − l2 − l1 − l0
at the patient level. Fig. 7 demonstrates the CAD system input scales and also a gating network with l0 input. Regarding
Recall changes versus τ on Dataset1 and Dataset2. the evaluation of the feature space partitioning hypothesis by
MCME components; mean correlation coefficient, DbD, and
Cohen’s kappa (κ) factors were assessed which were obtained
to be −0.04, 0.79, and 0.03 respectively for built-in CNN
experts in the model.
For the benchmark study, to get better insight into the
performance of the MCME model, five different techniques
were considered as the baselines: (1) HOG-LSVM, (2) Off-
the-shelf CNNs, (3) Single-scale CNNs, (4) Convolutional
Ave-ensemble model, and (5) Full-rank convolutional mixture
of experts model.
For this purpose, in the first step the HOG-LSVM
method [22] was tried as a classic feature-based technique
with a precision of 85.35% and AUC of 0.903.
Moreover, the recent competitive deep networks VGG19,
ResNet50, and InceptionV3 were evaluated separately in this
problem as the off-the-shelf visual feature extractors. So, the
extracted features were classified by the Softmax classifier.
Fig. 7: Recall vs. τ curves of the selected MCME models on As reported in Table III, the best results belonged to the
the datasets evaluated by the 5-fold cross-validation method. ResNet50 model with a precision of 95.31%. Indeed, the
model’s execution time in the feature extraction step was
The evaluated models were coded in Python 2.7 (based on about 33.2 sec/image on average. It should be noted that,
Theano v0.8 [56] and Keras v1.2 [57] Toolkits) and trained as pointed out in [58], applications of the off-the-shelf deep
using the NVIDIA Pascal GTX 1080 GPU card with NVIDIA CNNs to computer aided diagnostic or detection systems can
Cuda v8.0 and cuDNN v5.1 accelerated library. It should be improved by either exploring the complementary role of
be noted that when using the testing code, the classification hand-crafted features or by training CNNs from scratch on
average time was about 8 msec/ROI on average for the the target dataset. The first solution is not the main focus
l3 − l2 − l1 − l0 MCME model. This time is approximately 1.4 in this paper. For the latter, since the training of such the
times faster than the best FCME model with three l0 -dependent models needs very large databases along with special hardware
CNN experts. requirements; the data-driven representation approach via full-
training of an ensemble of CNNs was chosen as the alternative
IV. D ISCUSSION strategy for this classification problem.
In the present study, we proposed a novel and automatic In the next baseline study, the performance of single-
CAD system in retinal OCT images using a new deep en- scale CNNs was explored for l0, l1, l2 and l3 input scales
semble model (i.e., MCME) in classification stage. To assess separately. To this purpose, four different CNNs were designed
the CAD system, the use of different scales (experts) in the and optimized with considering a minimum number of free
MCME model, and the effects on diagnostic performance parameters along with acceptable performance as described
and time complexity were explored based on two different in subsection III-C1. As expected, the CNN1 with 19-active
datasets. The first one was a local dataset of 148 subjects. This layers and l0 -scale receptive field had better performance than
dataset was used to investigate the best configurations of the the other evaluated single-scale CNNs. For this model, the

Copyright (c) 2017 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TMI.2017.2780115
9

Fig. 8: Extracted feature maps of C1 layers in the trained l3 − l2 − l1 − l0 MCME model for an input ROI in DME class.

overall precision and the training time were 97.50% and 0.086 was about 0.193 sec/ROI. In comparison to the mixture
sec/ROI, respectively. Compared to the evaluated MCME ensemble of full-ranked CNNs, using the new penalty term
models, CNN1’s time complexity is comparable with the l3 −l2 could effectively enhance the performance of the FCME
MCME while the CNN1 performance is less than this model. model. Confirming this claim, the 3-experts FCME at λ = 0
This outcome proves the efficiency of the selected ensemble (i.e., without the penalty term ρ) had a precision of 97.75%
approach versus the comparable vanilla CNNs. while the average correlation coefficient, DbD, and κ factors
For the comparison purpose, the common averaging en- for the CNN experts were 0.07, 0.45, and 0.43, respectively.
semble method of single-CNNs (the Ave-ensemble) was also Based on the results in Table III, since the MCME model
evaluated on dataset 1. This model, with averaging of the with the fusion of l1 − l0 scales had a performance same as
CNN1, CNN2, CNN3, and CNN4 output-maps yielded an the l0 −l0 −l0 FCME but in a training time of 0.120 sec/ROI,
AUC of 0.994 and an overall precision of 96.45%. This proposing the prior MSSP decomposition in the MCME model
model couldn’t attain to a comparable performance with the (see Section II-C) is an impressive strategy for a fast and
l3 −l2 −l1 −l0 MCME model, although the training time of the discriminative information fusion. Based on the preliminary
two models was roughly the same (169 vs. 170 ms/ROI). In analyses on this dataset, the MCME results and complexity
fact, in the Ave-ensemble, all the independent CNNs have the were improved in comparison to the best FCME model on the
same combination weights, and are treated equally. However, retinal OCT database. So, the proposed MCME has a greater
not all the scale-dependent CNNs are equally important. In potential to analyze the retinal VOIs in comparison to the best
contrast with the Ave-ensemble, different CNNs in the MCME FCME method. For this MCME model with fewer parameters,
can be adaptively specialized to different parts of the feature less training time-complexity (0.170 vs. 0.193 sec/ROI) and
space by means of the CGN weighting role in the learning fewer memory requirements, the performance of the MCME
phase. So, making the final decision of the ensemble with these was higher than the best FCME model. These outcomes are
weighted outputs can effectively enhance the performance. On mainly valuable in case of encountering a medical diagnosis
the other hand, as discussed in [44] and [59], the combining problem, including a limited number of samples and with an
results in neural ensemble models are weakened specifically in-depth full-training procedure. Fig. 8 shows 2-D extracted
if the errors of individual nets are positively correlated. This feature maps of C1 layers in the trained MCME model for an
phenomenon occurred here too because for the subnetworks input ROI. This figure shows that each trained module in the
in the Ave-ensemble model the average correlation coefficient MCME model generates different feature maps based on its
and κ factors were 0.31 and 0.77 respectively, while these specific input scale.
values were −0.04 and 0.03 for the selected MCME model As demonstrated in Fig. 7, regarding the recall measure,
on dataset 1. τ = 15% is the best value for volumetric diagnosis at the
Furthermore, the FCME structure was assessed by fusion patient level based on predicted abnormal B-scans by the l3 −
of 2, 3, and 4 experts with l0 input scale. So, the best results l2 − l1 − l0 MCME on dataset 1. It should be noted that, in
were obtained using 3-experts with an overall precision of some volumetric OCT acquisitions (those with 19 B-scans),
98.83% at λ = 0.1, where the training time for this model the threshold of 15% includes only 3 abnormal B-scans for

Copyright (c) 2017 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TMI.2017.2780115
10

decision making. Furthermore, the model obtained a recall of V. C ONCLUSION


95.97% at τ = 1% on the dataset. The present study introduced and evaluated a novel CAD
system for retinal pathology diagnosis in macular OCT, which
B. MCME Analysis based on Dataset 2 does not rely on image denoising, full retinal layers segmen-
tation, or lesions detection processes. The main contribution
For dataset 2, the l2 −l1 −l0 MCME (with 9268 free param- of this study was to introduce and analyze the MCME model
eters) outperformed the l3 − l2 − l1 − l0 MCME at τ = 15% in retinal OCT classification problem. Therefore, the mathe-
where the precision and AUC values were 98.33% and 0.999, matical model of the novel classifier was introduced which
respectively for this model. Although the l3 − l2 − l1 − l0 is coupled with a new cost function based on the addition
MCME at λ = 0.7 and τ = 20% results in a precision of a cross-correlation penalty term. Data-driven features and
of 98.33%, the ratio of its free parameters (i.e., 10285) to the representative ability of the proposed model benefit to
the number of augmented training samples (10940 ROIs on reduce the complexity and diagnosis error and to obtain an
average in the CV5 method) makes it more prone to be over- overall average precision rate of 98.86% on two datasets of
fitted, therefore it presents a lower diagnostic sensitivity than 148 and 45 retinal OCT volumes including dry AMD, DME,
the l2 − l1 − l0 MCME model at τ = 15%. It should be and normal subjects. As the future work, it is expected that
noted that both the latter MCME models have a lower false with a larger database including more retinal pathologies with
positive diagnostic rate rather than the HOG-based classifica- a larger amount of different cases, and extended convolutional
tion method by Srinivasan et al. [22] on this database because modules, the performance of the proposed MCME model
they just misclassified a normal case as DME one in the CV5 should be significantly improved.
process. Compared to the sparse coding approach presented by
Sun et al. [27], although the classification performance of the
methods is similar on this dataset, the MCME approach does R EFERENCES
not rely on any denoising step in the preprocessing algorithm.
[1] T. Vos, R. M. Barber, B. Bell, A. Bertozzi-Villa, S. Biryukov, I. Bolliger,
This is where the performance of [27] algorithm is evaluated F. Charlson, A. Davis, L. Degenhardt, D. Dicker et al., “Global, regional,
by the leave-three-out method. and national incidence, prevalence, and years lived with disability for
Generally, based on Table III and Fig. 6, we observe that 301 acute and chronic diseases and injuries in 188 countries, 1990-2013:
a systematic analysis for the global burden of disease study 2013,” The
the adjusting parameter (λ) in the proposed cost function can Lancet, vol. 386, no. 9995, p. 743, 2015.
effectively impress the MCME performance. As pointed out [2] S. Mehta, “Age-related macular degeneration,” Primary Care: Clinics in
in [44], this phenomenon occurs due to the control ability of Office Practice, vol. 42, no. 3, pp. 377–391, 2015.
[3] M. M. Engelgau, L. S. Geiss, J. B. Saaddine, J. P. Boyle, S. M. Benjamin,
the bias-variance-covariance trade-off in the learning process E. W. Gregg, E. F. Tierney, N. Rios-Burrows, A. H. Mokdad, E. S. Ford
by λ. It seems that an optimum increasing diversity among et al., “The evolving diabetes burden in the United States,” Annals of
the scale-dependent experts by the λ leads to promising Internal Medicine, vol. 140, no. 11, pp. 945–950, 2004.
[4] S. Pershing, E. A. Enns, B. Matesic, D. K. Owens, and J. D. Goldhaber-
improvements in the performance of the MCME model with Fiebert, “Cost-effectiveness of treatment of diabetic macular edema,”
different scales fusion. Annals of Internal Medicine, vol. 160, no. 1, pp. 18–29, 2014.
Moreover, the performance of the proposed MCME models [5] R. Bernardes and J. Cunha-Vaz, Optical coherence tomography: a
clinical and technical update. Springer Sci. & Business Media, 2012.
is impressively influenced by an intrinsic competition of [6] https://www.cms.gov/Research-Statistics-Data-and-Systems/
information fusion by scale-dependent CNN experts for input Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/
pattern classification, while the CGN rewards the winner of Physician-and-Other-Supplier.htm, [Online], 2016.
[7] M. E. van Velthoven, D. J. Faber, F. D. Verbraak, T. G. van Leeuwen, and
each competition with stronger and specific error feedback M. D. de Smet, “Recent developments in optical coherence tomography
signals [54]. Thus, the CGN can partition the feature space for imaging the retina,” Progress in Retinal and Eye Research, vol. 26,
according to the experts’ performance, decrease the experts’ no. 1, pp. 57–77, 2007.
[8] H. Rabbani, M. Sonka, and M. D. Abramoff, “Optical coherence
correlations and κ, and increase the overall diagnosis perfor- tomography noise reduction using anisotropic local bivariate gaussian
mance consequently. mixture prior in 3D complex wavelet domain,” Journal of Biomedical
Besides, by considering the time complexity in Table III, Imaging, vol. 2013, pp. 10–22, 2013.
[9] R. Kafieh, H. Rabbani, and I. Selesnick, “Three dimensional data-driven
the MCME also addresses an efficient full-training strategy multi scale atomic representation of optical coherence tomography,”
and a fast and powerful representation model in pathologic IEEE Transactions on Medical Imaging, vol. 34, no. 5, pp. 1042–1062,
OCT diagnosis. 2015.
[10] Z. Amini and H. Rabbani, “Statistical modeling of retinal optical
Among several promising directions, one could extend the coherence tomography,” IEEE Transactions on Medical Imaging, vol. 35,
MCME approach in macular OCT image classification without no. 6, pp. 1544–1554, 2016.
engaging the retinal alignment processing, by using more [11] R. Kafieh, H. Rabbani, M. D. Abramoff, and M. Sonka, “Curvature
correction of retinal OCTs using graph-based geometry detection,”
complex CNN modules and large OCT databases. Physics in Medicine and Biology, vol. 58, no. 9, pp. 2925–2938, 2013.
Finally, the experimental studies on two distinct datasets [12] M. D. Abràmoff, K. Lee, M. Niemeijer, W. L. Alward, E. C. Greenlee,
showed that with an optimum number of scale-dependent M. K. Garvin, M. Sonka, and Y. H. Kwon, “Automated segmentation
of the cup and rim from spectral domain OCT of the optic nerve head,”
convolutional experts, and a near-optimum balance in the bias- Inv. Ophth. & Visual Sci., vol. 50, no. 12, pp. 5778–5784, 2009.
variance-covariance trade-off provided by λ, the MCME struc- [13] Q. Yang, C. A. Reisman, Z. Wang, Y. Fukuma, M. Hangai,
ture with in-depth and full-training approach, would result in N. Yoshimura, A. Tomidokoro, M. Araie, A. S. Raza, D. C. Hood
et al., “Automated layer segmentation of macular OCT images using
an efficient, fast and reliable diagnostic classifier for macular dual-scale gradient information,” Optics Express, vol. 18, no. 20, pp.
OCT screening. 21 293–21 307, 2010.

Copyright (c) 2017 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TMI.2017.2780115
11

[14] G. Quellec, K. Lee, M. Dolejsi, M. K. Garvin, M. Abramoff, and [35] P. Burt and E. Adelson, “The laplacian pyramid as a compact image
M. Sonka, “Three-dimensional analysis of retinal layer texture: iden- code,” IEEE Transactions on Communications, vol. 31, no. 4, pp. 532–
tification of fluid-filled regions in SD-OCT of the macula,” IEEE Tran. 540, 1983.
on Medical Imaging, vol. 29, no. 6, pp. 1321–1330, 2010. [36] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical
[15] R. Kafieh, H. Rabbani, M. Abramoff, and M. Sonka, “Intra-retinal layer features for scene labeling,” IEEE Transactions on Pattern Analysis and
segmentation of 3D optical coherence tomography using coarse grained Machine Intelligence, vol. 35, no. 8, pp. 1915–1929, 2013.
diffusion map,” Med. Image Anal., vol. 17, no. 8, pp. 907–928, 2013. [37] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive
[16] P. A. Dufour, L. Ceklic, H. Abdillahi, S. Schroder, S. De Dzanet, mixtures of local experts,” Neural Computation, vol. 3, no. 1, pp. 79–87,
U. Wolf-Schnurrbusch, and J. Kowal, “Graph-based multi-surface seg- 1991.
mentation of OCT data using trained hard and soft constraints,” IEEE [38] M. H. Nguyen, H. A. Abbass, and R. I. Mckay, “A novel mixture
Transactions on Medical Imaging, vol. 32, no. 3, pp. 531–543, 2013. of experts model based on cooperative coevolution,” Neurocomputing,
[17] Y. Sun, T. Zhang, Y. Zhao, and Y. He, “3D automatic segmentation vol. 70, no. 1, pp. 155–163, 2006.
method for retinal optical coherence tomography volume data using [39] R. Ebrahimpour, E. Kabir, H. Esteky, and M. R. Yousefi, “View-
boundary surface enhancement,” Journal of Innovative Optical Health independent face recognition with mixture of experts,” Neurocomputing,
Sciences, vol. 9, no. 02, p. 1650008, 2016. vol. 71, no. 4, pp. 1103–1107, 2008.
[18] M. Esmaeili, A. Dehnavi, and H. Rabbani, “3D curvelet-based segmen- [40] M. Javadi, S. A. A. A. Arani, A. Sajedin, and R. Ebrahimpour,
tation and quantification of drusen in optical coherence tomography “Classification of ECG arrhythmia by a modular neural network based
images,” Journal of Electrical and Computer Engineering, vol. 2017, on mixture of experts and negatively correlated learning,” Biomedical
pp. 1–12, 2017. Signal Processing and Control, vol. 8, no. 3, pp. 289–296, 2013.
[19] Y.-Y. Liu, M. Chen, H. Ishikawa, G. Wollstein, J. Schuman, and J. Rehg, [41] M. N. Dailey and G. W. Cottrell, “Organization of face and object
“Automated macular pathology diagnosis in retinal OCT images using recognition in modular neural network models,” Neural Networks,
multi-scale spatial pyramid and local binary patterns in texture and shape vol. 12, no. 7, pp. 1053–1074, 1999.
encoding,” Medical Image Analysis, vol. 15, no. 5, pp. 748–759, 2011. [42] S. Masoudnia and R. Ebrahimpour, “Mixture of experts: a literature
[20] A. Albarrak, F. Coenen, and Y. Zheng, “Age-related macular degener- survey,” Artificial Intelligence Review, vol. 42, no. 2, pp. 275–293, 2014.
ation identification in volumetric optical coherence tomography using [43] R. A. Jacobs, “Bias/variance analyses of mixtures-of-experts architec-
decomposition and local feature extraction,” in Proc. of 2013 Int. Conf. tures,” Neural Computation, vol. 9, no. 2, pp. 369–383, 1997.
on Medical Image, Understanding and Analysis, 2013, pp. 59–64. [44] Y. Liu and X. Yao, “Simultaneous training of negatively correlated neural
[21] S. Farsiu, S. J. Chiu, R. V. O’Connell, F. A. Folgar, E. Yuan, J. A. Izatt, networks in an ensemble,” IEEE Transactions on Systems, Man, and
and C. A. Toth, “Quantitative classification of eyes with and without Cybernetics, Part B (Cybernetics), vol. 29, no. 6, pp. 716–725, 1999.
intermediate age-related macular degeneration using optical coherence [45] G. Hinton, N. Srivastava, and K. Swersky, “Lecture 6a overview of mini–
tomography,” Ophthalmology, vol. 121, no. 1, pp. 162–172, 2014. batch gradient descent,” Coursera Lecture slides https:// class.coursera.
[22] P. Srinivasan, L. Kim, P. Mettu, S. Cousins, G. Comer, J. Izatt, and org/ neuralnets-2012-001/ lecture,[Online], 2012.
S. Farsiu, “Fully automated detection of diabetic macular edema and dry [46] J. Cohen, “A coefficient of agreement for nominal scales,” Educational
age-related macular degeneration from optical coherence tomography and Psychological Measurement, vol. 20, no. 1, pp. 37–46, 1960.
images,” Biomedical Optics Express, vol. 5, no. 10, pp. 3568–3577, [47] C. O. Freitas, J. M. De Carvalho, J. Oliveira Jr, S. B. Aires, and
2014. R. Sabourin, “Confusion matrix disagreement for multiple classifiers,”
in Iberoamerican Congress on Pattern Recognition. Springer, 2007,
[23] F. G. Venhuizen, B. van Ginneken, B. Bloemen, M. J. van Grinsven,
pp. 387–396.
R. Philipsen, C. Hoyng, T. Theelen, and C. I. Sánchez, “Automated age-
[48] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
related macular degeneration classification in OCT using unsupervised
large-scale image recognition,” ArXiv preprint arXiv:1409.1556, pp. 1–
feature learning,” in SPIE Medical Imaging, 2015, pp. 94 141I–94 141I.
14, 2015.
[24] G. Lemaı̂tre, M. Rastgoo, J. Massich, C. Y. Cheung, T. Y. Wong,
[49] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
E. Lamoureux, D. Milea, F. Mériaudeau, and D. Sidibé, “Classification
recognition,” in Proceedings of the IEEE Conference on Computer Vision
of SD-OCT volumes using local binary patterns: Experimental validation
and Pattern Recognition, 2016, pp. 770–778.
for DME detection,” Journal of Ophthalmology, vol. 2016, 2016.
[50] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
[25] S. Apostolopoulos, C. Ciller, S. I. De Zanet, S. Wolf, and R. Sznitman,
the inception architecture for computer vision,” in Proceedings of the
“Retinet: Automatic AMD identification in OCT volumetric data,” ArXiv
IEEE Conference on Computer Vision and Pattern Recognition, 2016,
preprint arXiv:1610.03628, 2016.
pp. 2818–2826.
[26] F. G. Venhuizen, B. van Ginneken, F. van Asten, M. J. van Grinsven, [51] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
S. Fauser, C. B. Hoyng, T. Theelen, and C. I. Sánchez, “Automated network training by reducing internal covariate shift,” in International
staging of age-related macular degeneration using optical coherence Conference on Machine Learning, 2015, pp. 448–456.
tomographyautomated staging of AMD in OCT,” Investigative Ophthal- [52] P. Buyssens, A. Elmoataz, and O. Lézoray, “Multiscale convolutional
mology & Visual Science, vol. 58, no. 4, pp. 2318–2328, 2017. neural networks for vision–based classification of cells,” in Asian
[27] Y. Sun, S. Li, and Z. Sun, “Fully automated macular pathology detection Conference on Computer Vision. Springer, 2012, pp. 342–352.
in retina optical coherence tomography images using sparse coding and [53] H. Chen, Q. Dou, X. Wang, J. Qin, P.-A. Heng et al., “Mitosis detection
dictionary learning,” Journal of Biomedical Optics, vol. 22, no. 1, pp. in breast cancer histology images via deep cascaded networks.” in
016 012–016 012, 2017. Association for the Advancement of Artificial Intelligence, 2016, pp.
[28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning 1160–1166.
applied to document recognition,” Proceedings of the IEEE, vol. 86, [54] R. Rasti, M. Teshnehlab, and S. L. Phung, “Breast cancer diagnosis in
no. 11, pp. 2278–2324, 1998. DCE-MRI using mixture ensemble of convolutional neural networks,”
[29] Y. LeCun, K. Kavukcuoglu, and C. Farabet, “Convolutional networks Pattern Recognition, vol. 72, pp. 381–390, 2017.
and applications in vision,” in Proceedings of 2010 IEEE International [55] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
Symposium on Circuits and Systems (ISCAS). IEEE, 2010, pp. 253–256. feedforward neural networks.” in Aistats, vol. 9, 2010, pp. 249–256.
[30] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, [56] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow,
no. 7553, pp. 436–444, 2015. A. Bergeron, N. Bouchard, and Y. Bengio, “Theano: new features
[31] R. Rasti, M. Teshnehlab, and R. Jafari, “A CAD system for identification and speed improvements,” Deep Learning and Unsupervised Feature
and classification of breast cancer tumors in DCE-MR images based on Learning NIPS 2012 Workshop, 2012.
hierarchical convolutional neural networks,” Computational Intelligence [57] F. Chollet, “keras,” https://github.com/fchollet/keras, 2015.
in Electrical Engineering, vol. 6, no. 1, pp. 1–14, 2015. [58] H.-C. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao,
[32] J. Kittler, M. Hatef, R. P. Duin, and J. Matas, “On combining classi- D. Mollura, and R. M. Summers, “Deep convolutional neural networks
fiers,” IEEE Transactions on Pattern Analysis and Machine Intelligence, for computer-aided detection: CNN architectures, dataset characteristics
vol. 20, no. 3, pp. 226–239, 1998. and transfer learning,” IEEE Transactions on Medical Imaging, vol. 35,
[33] S. Kotsiantis, “Combining bagging, boosting, rotation forest and random no. 5, pp. 1285–1298, 2016.
subspace methods,” Artificial Intelligence Review, vol. 35, no. 3, pp. [59] M. P. Perrone and L. N. Cooper, “When networks disagree: Ensemble
223–240, 2011. methods for hybrid neural networks,” Brown Univ Providence RI Inst
[34] S. B. Kotsiantis, “An incremental ensemble of classifiers,” Artificial for Brain and Neural Systems, Tech. Rep., 1992.
Intelligence Review, vol. 36, no. 4, pp. 249–266, 2011.

View publication stats Copyright (c) 2017 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

You might also like