Enhancing Effeciency of Ejection Fraction Calculationin The Left Ventricle
Enhancing Effeciency of Ejection Fraction Calculationin The Left Ventricle
Enhancing Effeciency of Ejection Fraction Calculationin The Left Ventricle
Volume: 4 Issue: 1 69 – 73
_______________________________________________________________________________________________
Abstract—The calculation of the cardiac ejection fraction is important for determining whether or not a patient suffers from cardiovascular
disease. However, manual calculation of the ejection fraction (EF) is prone to errors and is known to be prohibitively time-consuming. As such,
there have been endeavors to automate this process for the sake of saving time as well as improving accuracy of estimation.
Recently,GPUhave been proposed to enhance the performance of machine learning algorithms that attempt to estimate the EF. In addition, these
algorithms are considered a necessary component in solving computational efficiency issuesencountered in dealing with hugeDigital Imaging
and Communications in Medicine (DICOM)datasets.
In this study, we useda DICOM dataset of cardiac magnetic resonance imaging for 1200 human cases with different ages and gender to calculate
the ejection fraction in the left ventricle.Convolutional Neural Network (CNN) was the selected neural network for the training phase of
segmenting the LV and volume calculation.
Our target is enhancing efficiencyof CNN to speedup training phase, and subsequently the prediction of the CVDs by experimenting with
different GPU-based parallelism techniques, namely Data Parallelism (DP)and Model Parallelism (MP) in addition to the generic use of multiple
GPUs. Specifically, we performed four variants of experiments; the first was using GPUs without applying any control on its behavior, the
second two variants involve experiments using either DP alone or MP alone on multiple GPUs, while the fourth and final variant involves
combining both DP and MP. This was done on Amazon EC2 instances that support up to 8 GPUs per instance. We used two EC2 instances to
apply our experiment on 16 GPUs. Our experiments show that our proposed combination of both DP and MP havethe bestcomputational
efficiency. Precisely, a speedup of up to 9.88 (over a single GPU) was achieved when using 16 GPUs in parallel with combined DP and MP.
Keywords- Multiple GPUs; Data parallelism; Model Parallelism;Convolutional Neural Network
__________________________________________________*****_________________________________________________
69
IJFRCSCE | January 2018, Available @ http://www.ijfrcsce.org
_______________________________________________________________________________________
International Journal on Future Revolution in Computer Science & Communication Engineering ISSN: 2454-4248
Volume: 4 Issue: 1 69 – 73
_______________________________________________________________________________________________
EDV − ESV
𝐸𝐹% = 100 ∗ from the programmer to master. Java has no cost on
EDV exchanging parameters and communication between threads
Equation (1), EF calculation.
but there is difficulty of debugging an error in multithreaded
programs, competing for machine resources affect
where end diastolic volume (EDV) represents the volume of performance, hardware support for multithreading needs to
blood in the LV at the end of diastole when the cardiac be provided by the operating system too to reach maximum
muscle is completely relaxed and LV is filled maximally efficiency.
with blood, and end systolic volume (ESV) represents the D. Related Work
volume of blood in the LV at the end of systole when the
cardiac muscle is maximally contracted and LV pumped the Avendi et al. [2] automated segmentation of LV and
blood out [6]. calculation of EF using MICCAI 2009 challenge dataset and
CNN and stacked autoencoders techniques, Accuracy
B. Convolutional Neural Network (CNN) reached percentage up to +/-1.96 SD for EDV, ESV, EF is
CNN is a type of feedforward neural about 2.4% from manually calculated, Time cost for training
networkalgorithmthat has been shown to display high phase is 3.4 hours for CNN, 34.25 minutes for Autoencoders
accuracy in dealing with the medical image processing phase for 1350 images x 45 groups, applying the model on
[1,8,18], characterized by high learning speed [4]. It begins new image cost 0.25 seconds CNN, 0.002 seconds stacked-
with an input layer containing any number of images AE and 0.2 seconds segmentation. Margeta [10] segmented
followed by a number of hidden layers each consisting of LV using decision forests was less accurate than [1] due to
two main steps: including papillary muscles and trabeculations that caused
over segmentation. Zhen et al. [18] worked on recognition
i. The first step is convolutionwhich applies a filter and calculation of EF in LV, RV without segmentation of
containing the required shape that we are searching each chamber alone in MRI images, the best correlation
for inside the image, with this filter being of a coefficient value was 0.921 and least LV estimation error
smaller size than the image size. For example, was 0.010 ± 0.011 using combination of CNN and deep
assuming the required shape is a horizontal line, this belief nets for recognition.
step passes the filter all over the image to find all the
horizontal lines inside it, each success enters as an Kim et al. [9] enhanced performance using multiple
input image into the second step. GPUs to train CNN based on different frameworks, theano,
ii. The second step is Subsampling or Maxpool which Caffe, Torch, TensorFlow and CNTK, Maximum speedup
is responsible for storing the resized images from the was 2.6 using 4 GPU on CNTK 1bit-SGD
previous convolution step containing the required
II.OVERVIEW OF THE PROPOSED MODEL
object.
The number of filters used in first step is equal to number of We propose a fully automated CNN method to recognize
convolutions. The two steps may be repeated several times LV on datasetthat includes 1200 real cases of a human MRI
according to the requirement. The output of the last hidden with 2D cine DICOM images, each case has 13 different
layer (containing the last subsampled images) will be the planes, and each plane include 30 slices, representing a cycle
input for the fully connected layers that represent our of a complete heartbeat while the patientis holding his/her
classifier. The output layer can consist of one or more items breath, dataset used from
depending on the requirement.For example, if the SecondAnnualDataScienceBowl[12], including different
requirement is discriminating different diseases depending on genders and ages.
the shape, there will be number of outputs dependent on the
number of diseases we train the network on [5, 8].In our For segmentation phase, 110 SAX imageslabeled by
case, we need only one output image distinguishing the LV hand. Segmentation continued for all slicesusing CNN
location. resulted in a [0,1] range pixels image of same size of input
image where 1 represent LV pixel and 0 is out the LV. Table
C. Parallelism in Image Processing 1 define the CNN layers used in segmentation, where b =
Saxena et al. [17] worked on a detailed survey related to batch size, Conv = convolution, BN = batch normalization,
parallelism in image processing. The survey included GPU ReLU(x) = max(0, x), Sigmoid(x) = 1/(1+exp(x)), this CNN
usage which uses lower power than CPU despite having up model was used by the winners of Kaggle competition to
to 240 cores which is 30-60 times number of cores used in train neural network [12].Z- score based normalization was
CPU of servers, in addition to the thread manager that can applied as in (2) after each convolution layer [14].
support more than 10 thousand of threads per each core, can
TABLE1. CNN LAYERS
be managed programmatically using several high-level
programming languages, but higher cost than other methods. Layer factor Filter Output Shape
CUDA (Computed Unified Device Architecture) libraries Size
can be managed programmatically using several high-level Input (b, 1, 246, 246)
Conv+BN+ReLU 8 7 (b, 8, 240, 240)
programming languages, speedy integration with GPU, its
Conv+BN+ReLU 16 3 (b,16,238,238)
main limitation is that integrates only with NVIDIA. Open MaxPool 2 (b,16,119,119)
Computing Language (OpenCL) is less performing than Conv+BN+ReLU 32 3 (b,32,117,117)
CUDA. Hadoop is less performing than Java due to it its MaxPool 2 (b,32,58,58)
dependency on Matlab in image processing tasks as Matlab Conv+BN+ReLU 64 3 (b,64,56,56)
is already built on Java. OpenCV is specific to image MaxPool 2 (b,64,28,28)
processing with embedded functions that need extra work Conv+BN+ReLU 64 3 (b,64,26,26)
70
IJFRCSCE | January 2018, Available @ http://www.ijfrcsce.org
_______________________________________________________________________________________
International Journal on Future Revolution in Computer Science & Communication Engineering ISSN: 2454-4248
Volume: 4 Issue: 1 69 – 73
_______________________________________________________________________________________________
Conv+BN+ReLU 64 3 (b,64,28,28) steps with the 16 GPUs, we use 2 EC2 instances working
Upscale 2 (b,64,56,56) together in parallel.
Conv+BN+ReLU 64 3 (b,64,58,58)
Upscale 2 (b,64,116,116)
Conv+BN+ReLU 32 7 (b,32,122,122) The first experiment usesa single GPU and documents
Upscale 2 (b,32,244,244) the time and accuracy. This is followed by using the different
Conv+BN+ReLU 16 3 (b,16,246,246) models of parallelism techniques and comparing of the speed
Conv+BN+ReLU 8 7 (b,8,240,240) up at each phase of the experiment.
Conv+sigmoid 1 7 (b,1,246,246)
Parallelism in our experiment is implemented as follows:
𝑧 = (𝑥 – 𝜇) / 𝜎(2) Generic form of multiple GPUs, without applying any
Equation (2), Z score formula control from programming side on the GPUs or the
data batches passed to the GPU.
Figure1 shows a block diagram illustratingthe training Data parallelism (DP) represented by passing separate
phase, targeting segmenting the region of intertest (ROI).The minibatches over multiple GPUs. Figure 3 illustrates
First step is augmenting the DICOMs by rotation, how batches are passed separately on using 2 GPUs,
transposition and scaling, followed by dividing the input each GPU work on its part of data then the gradient
batches for processing through the PCIe switch that sends the exchange is done through the PCIe that carries the
batches to the GPUs. Then output classifier is the input for responsibility summing up gradients before updating
the EF calculation phase. Equation (3)is applied on DICOM weights over both GPUs.No direct exchange of
slices to get the volume of the LV at the start of the heart weights between GPUs which is expected to be the
beat when the heart is maximally contracted representing reason of delay in this technique.DP steps are
ESV and at the end of the beat when heart is fully relaxed repeated using 4, 8 and 16 GPUs. Data batches are
representing EDV.Both volumes are used in (1) to compute divided over number of GPUs then PCIe updates the
the EF [4]. Figure2 represents a sample of DICOM weights for all.Each GPU call the whole CNN model
slicesfrom the used dataset before and after segmentation. over the batch.
Fig. 1, Training Phase Block. MP represented by using multiple GPUs, with CNN
implemented divided between nodes, same batch pass
𝑖 𝑥 𝐴 𝑠,𝑡 + 𝐴 𝑠+1,𝑡 ℎ ℎ from node to the nextsubsequently.It exchanges the
𝑉𝑡 = 𝑠= 2 𝐿 + 𝐿 + 𝐴1,𝑡 ∗ + 𝐴𝑁,𝑡 ∗ (3)
𝑠+1 𝑠 2 2 weights directly with each other using the NVLink
Equation (3), LV volume detection from DICOM.A represent the area without returning to the PCIe.On using 4 GPUs,CNN
of slice s at time t,ls is the slice location and w is the slice thickness. layers were distributed between the 4 nodes which
may be the cause of some delay due batch transfer
from node to node. The experiment was repeated with
distributing CNN nodes.
Combining both DP and MP techniques in same
operation. This model was done using 4 GPUs, each 2
nodes are considered an MP unit where the CNN
model is divided on both nodes, the unit is cloned on
the remaining 2 nodes, batches are divided between
both MP units which represent the DP part of the
technique. The process was repeated using 2 MP
units each consists of 4 GPUs. The last turn, as each
EC2 instance has maximum 8 GPUs, we cloned MP
Fig. 2, Image represents3 slices of same case with LV segementation result model on each instance, then we divided the batches
in the second row represented with the blue color. between the 2 instances.
The platform used is Amazon EC2 with NVIDIA Tesla III.EXPERIMENTAL RESULTS
V100 GPUs, on Ubuntu 16 OS, 128 GiB, CUDA 8, NVLink Applying the proposed model on training and calculating
300 GBps, Network bandwidth 25 Gbps, RAM 488 GiB, efficiency on the DICOM including the 1200 case, each case
PCI-Express (PCIe) fabric switch. The maximum number of includes 13 different planes with 30 slices for each plan,
GPUs on the EC2 instance is 8, so on repeating experiment MRI of human heart, dataset is from [12].Human with
different ages and gender to predict the probability of CVDs
71
IJFRCSCE | January 2018, Available @ http://www.ijfrcsce.org
_______________________________________________________________________________________
International Journal on Future Revolution in Computer Science & Communication Engineering ISSN: 2454-4248
Volume: 4 Issue: 1 69 – 73
_______________________________________________________________________________________________
by calculating the EF in LV.Table 2, demonstrates the
9.875
speedup results based on (4)that compares the execution time
using a single GPU (𝑇𝑠 ) to time taken by multiple GPUs (𝑇𝑝 )
7.182
in every step of the experiment, the speedup calculations are
4.938
4.647
represented in the Figure4 chart showing highestefficiency
SPEED-UP
3.435
3.038
using the generic model of GPUs while least effect on
2.821
2.633
2.633
1.881
efficiency is by using DP model.
1.549
1.411
1.179
1.037
𝑇𝑠
𝑆𝑝 = (4)
𝑇𝑝
Equation (4), Speedup. 16 8 4 2
Expected that the cause of delay in DP model is the Using multiple GPUs enhance the training performance
dependency on the PCIe and CPU to propagate the gradients without affecting the accuracy in an approximately linear
while on letting the GPU to send the results directly to each relation with the number of GPUs, however on using
other using the NVLink accelerate the training phase in MP different types of GPUs causes some truncation/round-off
and combined form.Also delay in MP alone is expected to be errors.
due to single thread for all dataset batches. While the Additional work will be required to compare the CPUs
combined form a major breakthrough was achieved as data with the GPUs, keeping same efficiency and speedup results
was divided between the MP units. in mind. Expected that usage of CPUs will avoid
DP and MP were shown to be less effective in improving truncation/round-off errors but cost of keeping same
computational efficiency than generic form, while combining promising speedup will be calculated.
them turned out to be even more effective.The two
REFERENCES
techniques fortunately complement one another with
minimal conflict.Using generic form of GPU without [1] Avendi, M. R., A. Kheradvar, and Hamid Jafarkhani. "A
interfering in their behavior with any technique proved to Combined Deep-Learning and Deformable-Model Approach
have better effect than using DP or MP alone. to Fully Automatic Segmentation of the Left Ventricle in
Cardiac MRI." arXiv preprint arXiv:1512.07951 (2015).
[2] Biering-Sorensen T, Shah S, Ananda I, Sweitzer N, Claggett
B, Pitt B, Pfeffer M, Solomon S, Shah A. PROGNOSTIC
IMPORTANCE OF LEFT VENTRICULAR MECHANICAL
DYSSYNCHRONY IN HEART FAILURE WITH
PRESERVED EJECTION FRACTION. Journal of the
American College of Cardiology. 2016 Apr 5;67(13_S):1484-.
[3] Canciello G, de Simone G, Izzo R, Giamundo A, Pacelli F,
Mancusi C, Galderisi M, Trimarco B, Losi MA. Validation of
left atrial volume estimation by left atrial diameter from the
parasternal long-axis view. Journal of the American Society
of Echocardiography. 2017 Mar 31;30(3):262-9.
72
IJFRCSCE | January 2018, Available @ http://www.ijfrcsce.org
_______________________________________________________________________________________
International Journal on Future Revolution in Computer Science & Communication Engineering ISSN: 2454-4248
Volume: 4 Issue: 1 69 – 73
_______________________________________________________________________________________________
[4] Glauner PO. Comparison of Training Methods for Deep
Neural Networks. arXiv preprint arXiv:1504.06825. 2015 Apr
26.
[5] Gu J, Wang Z, Kuen J, Ma L, Shahroudy A, Shuai B, Liu T,
Wang X, Wang G. Recent Advances in Convolutional Neural
Networks. arXiv preprint arXiv:1512.07108. 2015 Dec 22.
[6] Hall JE. Guyton and Hall textbook of medical physiology.
Elsevier Health Sciences; 2015 May 18.
[7] Herring W. Learning radiology: Recognizing the basics.
Elsevier Health Sciences; 2007 Jun 20.
[8] Highlander T. Efficient Training of Small Kernel
Convolutional Neural Networks using Fast Fourier
Transform (Doctoral dissertation, Wright State University),
2015.
[9] Kim H, Nam H, Jung W, Lee J. Performance analysis of CNN
frameworks for GPUs. InPerformance Analysis of Systems
and Software (ISPASS), 2017 IEEE International Symposium
on 2017 Apr 24 (pp. 55-64). IEEE.
[10] Margeta, J., 2015. Machine Learning for Simplifying the Use
of Cardiac Image Databases (Doctoral dissertation, MINES
ParisTech).
[11] Mendis S, Puska P, Norrving B. Global atlas on
cardiovascular disease prevention and control. World Health
Organization; 2011.
[12] NHLBI, Data Science Bowl Cardiac Challenge Data,
December 14, 2015https://www.kaggle.com/c/second-annual-
data-science-bowl/data
[13] Nichols M, Townsend N, Scarborough P, Rayner M.
Cardiovascular disease in Europe 2014: epidemiological
update. European heart journal. 2014 Aug 12:ehu299.
[14] Patro S, Sahu KK. Normalization: A preprocessing stage.
arXiv preprint arXiv:1503.06462. 2015 Mar 19.
[15] Poudel RP, Lamata P, Montana G. Recurrent fully
convolutional neural networks for multi-slice mri cardiac
segmentation. InInternational Workshop on Reconstruction
and Analysis of Moving Body Organs 2016 Oct 17 (pp. 83-
94). Springer, Cham.
[16] Radau P, Lu Y, Connelly K, Paul G, Dick AJ, Wright GA.
―Evaluation Framework for Algorithms Segmenting Short
Axis Cardiac MRI.‖ The MIDAS Journal – Cardiac MR Left
Ventricle Segmentation
Challenge, http://hdl.handle.net/10380/3070
[17] Saxena S, Sharma S, Sharma N. Parallel Image Processing
Techniques, Benefits and Limitations. Research Journal of
Applied Sciences, Engineering and Technology. 2016 Jan
20;12(2):223-38.
[18] Zhen X, Wang Z, Islam A, Bhaduri M, Chan I, Li S. Multi-
scale deep networks and regression forests for direct bi-
ventricular volume estimation. Medical Image Analysis. 2015
Jul 26
73
IJFRCSCE | January 2018, Available @ http://www.ijfrcsce.org
_______________________________________________________________________________________