Enhancing Effeciency of Ejection Fraction Calculationin The Left Ventricle
Volume: 4 Issue: 1 69 – 73
Abstract—The calculation of the cardiac ejection fraction is important for determining whether or not a patient suffers from cardiovascular
disease. However, manual calculation of the ejection fraction (EF) is prone to errors and is known to be prohibitively time-consuming. As such,
there have been endeavors to automate this process for the sake of saving time as well as improving accuracy of estimation.
Recently,GPUhave been proposed to enhance the performance of machine learning algorithms that attempt to estimate the EF. In addition, these
algorithms are considered a necessary component in solving computational efficiency issuesencountered in dealing with hugeDigital Imaging
and Communications in Medicine (DICOM)datasets.
In this study, we useda DICOM dataset of cardiac magnetic resonance imaging for 1200 human cases with different ages and gender to calculate
the ejection fraction in the left ventricle.Convolutional Neural Network (CNN) was the selected neural network for the training phase of
segmenting the LV and volume calculation.
Our target is enhancing efficiencyof CNN to speedup training phase, and subsequently the prediction of the CVDs by experimenting with
different GPU-based parallelism techniques, namely Data Parallelism (DP)and Model Parallelism (MP) in addition to the generic use of multiple
GPUs. Specifically, we performed four variants of experiments; the first was using GPUs without applying any control on its behavior, the
second two variants involve experiments using either DP alone or MP alone on multiple GPUs, while the fourth and final variant involves
combining both DP and MP. This was done on Amazon EC2 instances that support up to 8 GPUs per instance. We used two EC2 instances to
apply our experiment on 16 GPUs. Our experiments show that our proposed combination of both DP and MP havethe bestcomputational
efficiency. Precisely, a speedup of up to 9.88 (over a single GPU) was achieved when using 16 GPUs in parallel with combined DP and MP.
Keywords- Multiple GPUs; Data parallelism; Model Parallelism;Convolutional Neural Network
𝐸𝐹% = 100 ∗ from the programmer to master. Java has no cost on
EDV exchanging parameters and communication between threads
Equation (1), EF calculation.
but there is difficulty of debugging an error in multithreaded
programs, competing for machine resources affect
where end diastolic volume (EDV) represents the volume of performance, hardware support for multithreading needs to
blood in the LV at the end of diastole when the cardiac be provided by the operating system too to reach maximum
muscle is completely relaxed and LV is filled maximally efficiency.
with blood, and end systolic volume (ESV) represents the D. Related Work
volume of blood in the LV at the end of systole when the
cardiac muscle is maximally contracted and LV pumped the Avendi et al. [2] automated segmentation of LV and
blood out [6]. calculation of EF using MICCAI 2009 challenge dataset and
CNN and stacked autoencoders techniques, Accuracy
B. Convolutional Neural Network (CNN) reached percentage up to +/-1.96 SD for EDV, ESV, EF is
CNN is a type of feedforward neural about 2.4% from manually calculated, Time cost for training
networkalgorithmthat has been shown to display high phase is 3.4 hours for CNN, 34.25 minutes for Autoencoders
accuracy in dealing with the medical image processing phase for 1350 images x 45 groups, applying the model on
[1,8,18], characterized by high learning speed [4]. It begins new image cost 0.25 seconds CNN, 0.002 seconds stacked-
with an input layer containing any number of images AE and 0.2 seconds segmentation. Margeta [10] segmented
followed by a number of hidden layers each consisting of LV using decision forests was less accurate than [1] due to
two main steps: including papillary muscles and trabeculations that caused
over segmentation. Zhen et al. [18] worked on recognition
i. The first step is convolutionwhich applies a filter and calculation of EF in LV, RV without segmentation of
containing the required shape that we are searching each chamber alone in MRI images, the best correlation
for inside the image, with this filter being of a coefficient value was 0.921 and least LV estimation error
smaller size than the image size. For example, was 0.010 ± 0.011 using combination of CNN and deep
assuming the required shape is a horizontal line, this belief nets for recognition.
step passes the filter all over the image to find all the
horizontal lines inside it, each success enters as an Kim et al. [9] enhanced performance using multiple
input image into the second step. GPUs to train CNN based on different frameworks, theano,
ii. The second step is Subsampling or Maxpool which Caffe, Torch, TensorFlow and CNTK, Maximum speedup
is responsible for storing the resized images from the was 2.6 using 4 GPU on CNTK 1bit-SGD
previous convolution step containing the required
The number of filters used in first step is equal to number of We propose a fully automated CNN method to recognize
convolutions. The two steps may be repeated several times LV on datasetthat includes 1200 real cases of a human MRI
according to the requirement. The output of the last hidden with 2D cine DICOM images, each case has 13 different
layer (containing the last subsampled images) will be the planes, and each plane include 30 slices, representing a cycle
input for the fully connected layers that represent our of a complete heartbeat while the patientis holding his/her
classifier. The output layer can consist of one or more items breath, dataset used from
depending on the requirement.For example, if the SecondAnnualDataScienceBowl[12], including different
requirement is discriminating different diseases depending on genders and ages.
the shape, there will be number of outputs dependent on the
number of diseases we train the network on [5, 8].In our For segmentation phase, 110 SAX imageslabeled by
case, we need only one output image distinguishing the LV hand. Segmentation continued for all slicesusing CNN
location. resulted in a [0,1] range pixels image of same size of input
image where 1 represent LV pixel and 0 is out the LV. Table
C. Parallelism in Image Processing 1 define the CNN layers used in segmentation, where b =
Saxena et al. [17] worked on a detailed survey related to batch size, Conv = convolution, BN = batch normalization,
parallelism in image processing. The survey included GPU ReLU(x) = max(0, x), Sigmoid(x) = 1/(1+exp(x)), this CNN
usage which uses lower power than CPU despite having up model was used by the winners of Kaggle competition to
to 240 cores which is 30-60 times number of cores used in train neural network [12].Z- score based normalization was
CPU of servers, in addition to the thread manager that can applied as in (2) after each convolution layer [14].
support more than 10 thousand of threads per each core, can
be managed programmatically using several high-level
programming languages, but higher cost than other methods. Layer factor Filter Output Shape
CUDA (Computed Unified Device Architecture) libraries Size
can be managed programmatically using several high-level Input (b, 1, 246, 246)
Conv+BN+ReLU 8 7 (b, 8, 240, 240)
programming languages, speedy integration with GPU, its
Conv+BN+ReLU 16 3 (b,16,238,238)
main limitation is that integrates only with NVIDIA. Open MaxPool 2 (b,16,119,119)
Computing Language (OpenCL) is less performing than Conv+BN+ReLU 32 3 (b,32,117,117)
CUDA. Hadoop is less performing than Java due to it its MaxPool 2 (b,32,58,58)
dependency on Matlab in image processing tasks as Matlab Conv+BN+ReLU 64 3 (b,64,56,56)
is already built on Java. OpenCV is specific to image MaxPool 2 (b,64,28,28)
processing with embedded functions that need extra work Conv+BN+ReLU 64 3 (b,64,26,26)
Conv+BN+ReLU 64 3 (b,64,28,28) steps with the 16 GPUs, we use 2 EC2 instances working
Upscale 2 (b,64,56,56) together in parallel.
Conv+BN+ReLU 64 3 (b,64,58,58)
Upscale 2 (b,64,116,116)
Conv+BN+ReLU 32 7 (b,32,122,122) The first experiment usesa single GPU and documents
Upscale 2 (b,32,244,244) the time and accuracy. This is followed by using the different
Conv+BN+ReLU 16 3 (b,16,246,246) models of parallelism techniques and comparing of the speed
Conv+BN+ReLU 8 7 (b,8,240,240) up at each phase of the experiment.
Conv+sigmoid 1 7 (b,1,246,246)
Parallelism in our experiment is implemented as follows:
𝑧 = (𝑥 – 𝜇) / 𝜎(2) Generic form of multiple GPUs, without applying any
Equation (2), Z score formula control from programming side on the GPUs or the
data batches passed to the GPU.
Figure1 shows a block diagram illustratingthe training Data parallelism (DP) represented by passing separate
phase, targeting segmenting the region of intertest (ROI).The minibatches over multiple GPUs. Figure 3 illustrates
First step is augmenting the DICOMs by rotation, how batches are passed separately on using 2 GPUs,
transposition and scaling, followed by dividing the input each GPU work on its part of data then the gradient
batches for processing through the PCIe switch that sends the exchange is done through the PCIe that carries the
batches to the GPUs. Then output classifier is the input for responsibility summing up gradients before updating
the EF calculation phase. Equation (3)is applied on DICOM weights over both GPUs.No direct exchange of
slices to get the volume of the LV at the start of the heart weights between GPUs which is expected to be the
beat when the heart is maximally contracted representing reason of delay in this technique.DP steps are
ESV and at the end of the beat when heart is fully relaxed repeated using 4, 8 and 16 GPUs. Data batches are
representing EDV.Both volumes are used in (1) to compute divided over number of GPUs then PCIe updates the
the EF [4]. Figure2 represents a sample of DICOM weights for all.Each GPU call the whole CNN model
slicesfrom the used dataset before and after segmentation. over the batch.
Fig. 1, Training Phase Block. MP represented by using multiple GPUs, with CNN
implemented divided between nodes, same batch pass
𝑖 𝑥 𝐴 𝑠,𝑡 + 𝐴 𝑠+1,𝑡 ℎ ℎ from node to the nextsubsequently.It exchanges the
𝑉𝑡 = 𝑠= 2 𝐿 + 𝐿 + 𝐴1,𝑡 ∗ + 𝐴𝑁,𝑡 ∗ (3)
𝑠+1 𝑠 2 2 weights directly with each other using the NVLink
Equation (3), LV volume detection from DICOM.A represent the area without returning to the PCIe.On using 4 GPUs,CNN
of slice s at time t,ls is the slice location and w is the slice thickness. layers were distributed between the 4 nodes which
may be the cause of some delay due batch transfer
from node to node. The experiment was repeated with
distributing CNN nodes.
Combining both DP and MP techniques in same
operation. This model was done using 4 GPUs, each 2
nodes are considered an MP unit where the CNN
model is divided on both nodes, the unit is cloned on
the remaining 2 nodes, batches are divided between
both MP units which represent the DP part of the
technique. The process was repeated using 2 MP
units each consists of 4 GPUs. The last turn, as each
EC2 instance has maximum 8 GPUs, we cloned MP
Fig. 2, Image represents3 slices of same case with LV segementation result model on each instance, then we divided the batches
in the second row represented with the blue color. between the 2 instances.
The platform used is Amazon EC2 with NVIDIA Tesla III.EXPERIMENTAL RESULTS
V100 GPUs, on Ubuntu 16 OS, 128 GiB, CUDA 8, NVLink Applying the proposed model on training and calculating
300 GBps, Network bandwidth 25 Gbps, RAM 488 GiB, efficiency on the DICOM including the 1200 case, each case
PCI-Express (PCIe) fabric switch. The maximum number of includes 13 different planes with 30 slices for each plan,
GPUs on the EC2 instance is 8, so on repeating experiment MRI of human heart, dataset is from [12].Human with
different ages and gender to predict the probability of CVDs
by calculating the EF in LV.Table 2, demonstrates the
speedup results based on (4)that compares the execution time
using a single GPU (𝑇𝑠 ) to time taken by multiple GPUs (𝑇𝑝 )
in every step of the experiment, the speedup calculations are
represented in the Figure4 chart showing highestefficiency
using the generic model of GPUs while least effect on
efficiency is by using DP model.
𝑆𝑝 = (4)
Equation (4), Speedup. 16 8 4 2
Expected that the cause of delay in DP model is the Using multiple GPUs enhance the training performance
dependency on the PCIe and CPU to propagate the gradients without affecting the accuracy in an approximately linear
while on letting the GPU to send the results directly to each relation with the number of GPUs, however on using
other using the NVLink accelerate the training phase in MP different types of GPUs causes some truncation/round-off
and combined form.Also delay in MP alone is expected to be errors.
due to single thread for all dataset batches. While the Additional work will be required to compare the CPUs
combined form a major breakthrough was achieved as data with the GPUs, keeping same efficiency and speedup results
was divided between the MP units. in mind. Expected that usage of CPUs will avoid
DP and MP were shown to be less effective in improving truncation/round-off errors but cost of keeping same
computational efficiency than generic form, while combining promising speedup will be calculated.
them turned out to be even more effective.The two
techniques fortunately complement one another with
