Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
1 views

Comparative Performance of Several Recent Supervised Learning Algorithms

open source research paper

Uploaded by

arslan.chohan88
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Comparative Performance of Several Recent Supervised Learning Algorithms

open source research paper

Uploaded by

arslan.chohan88
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/324865905

Comparative Performance of Several Recent Supervised Learning Algorithms

Article · March 2018

CITATIONS READS

4 563

1 author:

Rana Ismail
Michigan State University
29 PUBLICATIONS 70 CITATIONS

SEE PROFILE

All content following this page was uploaded by Rana Ismail on 01 May 2018.

The user has requested enhancement of the downloaded file.


International Journal of Computer and Information Technology (ISSN: 2279 – 0764)
Volume 07 – Issue 02, March 2018

Comparative Performance of Several Recent


Supervised Learning Algorithms
Tony Bazzi* Rana Ismail
Optimal Medical Center
Department of Electrical and Systems Engineering Dearborn, MI, 48124
Oakland University
Rochester, MI 48309, USA
*
Email: tbazzi [AT] oakland.edu
Mohamed Zohdy
Department of Electrical and Systems Engineering
Oakland University
Rochester, MI 48309, USA

Abstract— A wide variety of optimization algorithms have back propagation where it combines Gauss-Newton algorithm
been developed, however their performance is still unclear (GNA) with the conventional gradient descent published in
across optimization landscapes. The manuscript presented 1944. Levenberg’s contribution modifies the GNA to include a
herein discusses methods for modeling and training neural hyperparameter µ allowing the learning to increase in the
networks on a small dataset. The algorithms include initial phases of training and decreases when the solution tends
conventional gradient descent, Levenberg-Marquardt, towards convergence depending on the rate of change of the
Momentum, Nesterov Momentum, ADAgrad, and cost function s(w) [9]. On the other hand, Marquardt’s
RMSprop learning methodologies. The work aims to extension in 1963 suggests scaling each component of the
compare the performance, efficiency, and accuracy of the gradient per the curvature so there are large movement where
different algorithms utilizing the fertility dataset available the rate of change of the cost function is smaller [10]. In 2012,
through the UC Irvine machine learning repository. an adaptive gradient method or ADAgrad that incorporates
knowledge of the geometry of the data being observed was
proposed by Duchi et al. [11]. Root mean square propagation
Keywords: Neural Networks, Back Propagation, Lebenberg- or RMSprop is a modified version of the ADAgrad that
Marquardt, Momentum, Nesterov, ADAgrad, RMSprop,
attempts to reduce its aggressiveness while monotonically
Hyperparameters, Newton Method, Supervised Learning, gradient
descent, delta rule, Python, Fertility
decreasing the learning rate. The latter method was presented
by Geoff Hinton in his Coursera class. While the RMSprop is
an unpublished training algorithm but it is being adapted by
I. INTRODUCTION the neurocomputing community, hence its addition to our
Background scope of work in this manuscript. Numerous training
An artificial neural network is a supervised machine learning algorithms were presented, discussed, and utilized in artificial
algorithm used by computers to model complex high neural networks with one hidden layer on one front and deep
dimensional datasets and make predictions without being learning neural networks such as recurrent and convoluted
explicitly programmed. Neural networks are biologically networks on the other front. Comparison of neural network
inspired through mimicking neurons operation in a human back-propagation algorithms have been carried out in the
brain. The first neural network was modeled in 1957 by Frank literature addressing specific case studies such as stream flow
Rosenblatt and was referred to as perceptron. Lately, neural forecasting, determination of lateral stress in cohesion-less
networks have been gaining popularity through deep learning soils, electricity load forecasting, radio network planning tool,
such as recurrent and convolution networks for speech power quality monitoring, software defect prediction, sleep
recognition and autonomous applications. Several back- disorders, and electro-static precipitation for air quality control
propagation learning algorithms were developed to assist with [12][13][14][15][16][17][18]. However, none of them
the convergence speed and performance of the networks. compare and study the effectiveness of the algorithms covered
Algorithms such as the momentum back propagation in our work especially when modeling complexities in high
algorithm has been widely used in neurocomputing [1][2] and dimensional small datasets. To bridge the gap, we employ the
several adaptive methodologies were then produced to fertility database from UC Irvine machine learning repository
dynamically vary the momentum parameter as described in to compare the performance and the effectiveness of
[3][4][5][6][7]. Nesterov’s momentum is one of the adaptive conventional gradient descent, Levenberg-Marquardt,
algorithms that gained popularity where it was developed by Momentum, Nesterov Momentum, ADAgrad, and RMSprop
Yuri Nesterov in 1983 [8]. Levenberg’s algorithm is another learning methods through studying the mean square error

www.ijcit.com 49
International Journal of Computer and Information Technology (ISSN: 2279 – 0764)
Volume 07 – Issue 02, March 2018

propagation versus the number of iterations and visually updating and tuning the parameters to minimize the cost
comparing actual versus predicted observations for our labeled function. The gradient descent rule is described
dataset. The fertility dataset ivolves analyzing semen samples mathematically per the following formula.
from 100 volunteers per the 2010 world health organization
J  s ( ) 
T
criteria such as socio-demographic data, environmental
 t      
factors, health status, and life habits. The neural network along    
with the different training algorithms presented in our scope of  new  old   t
work are programmed in Python version 2.7 utilizing the   Learning rate (1)
Spyder 2.3.8 development environment employing the   Weights
“numpy” fundamental package for scientific computing. It is J  Cost function
important to mention that all the algorithms and the neural t = Number of iteration
network were developed and programmed without using
generic Python scripts or APIs such as Scipy and Kerras.

Adaptive Momentum Back-Propagation


The momentum back-propagation algorithm for training
II. THEORY neural network has been widely used in neuro-computing
Neural Network [21][22]. The convergence behavior has also been analyzed in
The end design of the neural network is illustrated in figure 1. [20][23][24][25][26]. It was first suggested that the
The network receives an input matrix with 9 features and 100 momentum coefficient should remain constant but it has been
data point hence our small data set criteria. The hidden layer observed that it should be changed dynamically throughout the
includes 9 nodes while the output layer includes a single node. training of the network hence the nomenclature BPAM (back-
The hidden nodes are activated with hyperbolic functions propagation with adaptive momentum) [27]. The algorithm is
while the output node or fertility predictor is a pure linear similar in nature to the gradient descent but involves an
activation function. The choice of a single hidden layer is a additional hyper-parameter µ that starts initially with a typical
result of the smaller number of attributes. Nonetheless, Hornik value of 0.5 and then progressively annealed as the number of
[19] showed that given enough hidden units, a single layer iterations increases. The mathematical formulation used
feedforward network can approximate any continuous programmed for our work is in accordance with the following
function described in [18]. equations.
J
vt     t  1 * vt  1

 t   t  1  vt
 new   old   t
  Learning rate (2)
  Weights
J  Cost function
 = Momentum Hyperparameter
v= Velocity parameter

Figure 1: Neural interpretation diagram for the network Nesterov’s Momentum


used. Nesterov’s adaptive gradient is a first order optimization
technique that helps the stability and convergence of gradient
Training Algorithms descent [28]. The algorithm is computed as follows.
The process of tuning the neural network parameters or
J
updating the weights to emulate the characteristics of an input- vt     t  1 * vt  1
output pattern is performed by the following back propagation 
algorithms.  new   old   t  1vt  1  (1   ) * v
  Learning rate (3)
Gradient Descent   Weights
The gradient descent back propagation training algorithm is J  Cost function
commonly referred to as the MIT or Delta rule. Suppose we
 = Momentum Hyperparameter
have a cost function J ( ) where  depicts the weights
v= Velocity parameter
components, the delta rule performs a gradient search while

www.ijcit.com 50
International Journal of Computer and Information Technology (ISSN: 2279 – 0764)
Volume 07 – Issue 02, March 2018

Adaptive Gradient Learning Algorithm e ( )  z  g ( )  0


Duchi et al. presented sub gradient methods that dynamically
incorporate knowledge of the geometry of the data in earlier Expanding e ( ) by Taylor series to:
iterations for more informative back propagation learning
  ( ) 
[11]. The algorithm is more suitable for sparse data and it is e ( )  z   g ( )     h.o.t   0
used to train large scale neural networks such as the work  0   0 
presented by Pennington et al.for Glove word embeddings and g ( )
Set y  z  g ( )   and A  or Jacobian matrix gives:
Google for detecting images of cats in youtube videos [29] 0 
[30]. The idea about ADAgrad is that it performs larger
updates for infrequent and smaller updates for frequent e ( )  y  A  0 (6)
parameters. In our work, we aim to test the algorithm on a
small database to gauge its performance and accuracy. We want to find  so the sum of squared error is minimized:

J J( )  s ( )  e ( )T e ( )  ( y  A )T ( y  A )

 t   
Gt   Perturbing s ( ) and appyling partial derivation with respect to  yields:

 J 
2

Gt    s (  )   y  A (  ) T  y  A (  )    y  A  A T  y  A  A 


  
 s (  )   y  A  A )   y  A  A 
 
T
 new   old   t
  2 y  A  A ) T   A   0
  Learning rate (4)  
  Weights
 
1
  AT A AT  y  A 
J  Cost function
t = Number of iteration
Levenberg’s Contribution
  Tuning parameter to prevent division by Zero Levenberg modifies the Gauss-Newton algorithm to include
the parameter µ multiplied by an identity matrix in the
Root Mean Squared Propagation following manner.
Geoffrey Hinton in his Coursera lectures introduces an
unpublished method to train neural networks called RMSprop.
   y  A 
1 T
RMSprop modifies the ADAgrad method reducing its   AT A   I A (7)
aggressive monotonically decreasing learning rate by
employing a moving average of the squared gradients Gt , ii . When our cost function J decreases rapidly, µ is set to a small
J value where our weights update  becomes  AT A  A  y  A 
1 T

 t    which is the GNA. On the other hand, if the cost function
Gt   decreases slowly µ is set to a large value and the weight
 J 
2
update  becomes   I  A  y  A  or the traditional gradient
1 T
Gt  0.9* Gt  1  0.1*  
  
descent illustrated by J transposed.
 new   old   t 
  Learning rate (5)
  Weights Marquardt’s Contribution
J  Cost function Marquardt suggests replacing the identity matrix by the
t = Number of iteration diagonal matrix consisting of the diagonal elements of  AT A  .
  Tuning parameter to prevent division by Zero The latter mitigates slow convergence in the direction of small
gradients hence the following Levenberg-Marquardt’s
Gauss-Newton Algorithm algorithm.
   y  A 
The Gauss-Newton algorithm or GNA is a modification of 1 T
  AT A   .diag ( AT A )  0 . I A (8)
Newton’s algorithm without having the complexity of
calculating the second derivatives. The methodology involves
0 . I prevents the singularity condition of AT A
solving nonlinear least square problems to minimize the sum
of squared of our cost function J ( ) or s ( ) per the following
formulation.

www.ijcit.com 51
International Journal of Computer and Information Technology (ISSN: 2279 – 0764)
Volume 07 – Issue 02, March 2018

III. RESULTS
The neural network parameters used in our work include nine
input, nine hidden, and 1 output node. The learning rate is set
to 0.4 for all cases while the momentum parameter is set to 0.4
for both Nesterov and regular momentum methods. On the
other hand, for the Levenberg-Marquardt algorithm, the
parameter µ is initialized to 0.001, 0.1 decrease factor, and 10
increase factor. The weights matrices initialization is constant
for all cases. The training stops when the mean squared error
reaches 0.0001 for all cases. The results are presented visually
in figures 2 through 7 while table 1 provides a summary
comparing the models in terms of performance and accuracy.
The graphical representation includes two plots where the first
subplot shows the mean square error propagation versus the
number of iterations while the second subplot includes the
neural network predicted outputs and the dataset actual
outputs.

Figure 3: Gradient Descent algorithm implementation

Figure 2: Levenberg-Marquardt algorithm


implementation
Figure 4: Momentum algorithm implementation

www.ijcit.com 52
International Journal of Computer and Information Technology (ISSN: 2279 – 0764)
Volume 07 – Issue 02, March 2018

Figure 5: Nesterov Momentum algorithm implementation


Figure 7: RMSprop algorithm implementation

Algorithm Convergence Accuracy (%)


(#of iterations)
Gradient Descent 1844 100%
Momentum 562 100%
Nesterov Momentum 401 100%
ADAgrad 257 100%
Levenberg-Marquardt 165 100%
RMSprop 11 100%

Table1 summarizes the convergence speed in number of


iteration and the performance of the model in terms of
percent accuracy

IV. CONCLUSION
The work performed in this manuscript suggests that all
learning models sustained a 100% accuracy. The RMSprop
algorithm was the fastest algorithm and attained convergence
in 11 iterations compared to 165 iterations for the Levenberg-
Marquardt algorithm ranked second. The RMSprop algorithm
Figure 6: ADAgrad algorithm implementation
is more efficient costing less computing power than the
Levenberg-Marquardt due to having to perform matrix
inversion in the latter. In addition, the RMSprop algorithm
reduces the number of hyper-parameters compared to the LM
methodology. Although our dataset is considered small but it
does provide a good benchmarking tool when considering
large scale datasets or deep learning such as convolution
neural networks. The RMSprop,ADAgrad, and LM algorithms
convergence were more stable as they progressed towards the
global minimum compared with the gradient descent,
conventional momentum, and Nesterov’s momentum
methodologies.

www.ijcit.com 53
International Journal of Computer and Information Technology (ISSN: 2279 – 0764)
Volume 07 – Issue 02, March 2018

REFERENCES [16] I. Arora and A. Saha, "Comparison of back propagation training


algorithms for software defect prediction," 2016 2nd
[1] D.E. Rumelhart, J.L. McClelland, the PDP Research Group, International Conference on Contemporary Computing and
Parall Distributed Processing—Explorations in the Informatics (IC3I), Noida, 2016, pp. 51-58. doi:
Microstructure of Cognition, MIT Press, MA, 1986. 10.1109/IC3I.2016.7917934
[2] M.R. Meybodi, H. Beigy, A note on learning automata-based [17] V. K. Garg and R. K. Bansal, "Comparison of neural network
schemes for adaptation of BP parameters, Neurocomputing 48 back propagation algorithms for early detection of sleep
(2002) 957–974.] S. J. Miller, The method of least squares, disorders," 2015 International Conference on Advances in
Mathematics Department, Brown University.. Computer Engineering and Applications, Ghaziabad, 2015, pp.
[3] L.W. Chan, F. Fallside, An adaptive training algorithm for 71-75. doi: 10.1109/ICACEA.2015.7164648
backpropagation networks, Computer Speech and Language 2 [18] Bazzi T., Estes M., Scherer B., 2016/8. “Artificial Intelligence
(1987) 205–218. For Precipitator Diagnostics”. Power Plant Pollutant Control
[4] G. Qiu, M.R. Varley, T.J. Terrell, Accelerated training of MEGA Symposium (MEGA 2016), Air and Waste Management
backpropagation networks by using adaptive momentum step, Association ( A&WMA ), Vol 1, ISBN 9781510829862.
IEE Electronics Letters 28 (4)(1992) 377–379. [19] K. Hornik. Approximation capabilities of multilayer
[5] X. Yu, N.K. Loh, W.C. Miller, A new acceleration technique for feedforward net-works. Neural Networks, 4(2):251-257, 1991.
the backpropagation algorithm, in: IEEE International [20] Park C., Ki M., Namkung J., Paik J. (2006) Multimodal Priority
Conference on Neural Networks, vol. 3, 1993, pp. 1157–1161. Verification of Face and Speech Using Momentum Back-
[6] E. Istook, T. Martinez, Improved backpropagation learning in Propagation Neural Network. In: Wang J., Yi Z., Zurada J.M.,
nerural networks with windowed momentum, International Lu BL., Yin H. (eds) Advances in Neural Networks - ISNN
Journal of Neural System 12 (3–4) (2002) 303–318. 2006. ISNN 2006. Lecture Notes in Computer Science, vol
[7] C. Yu, B. Liu, A backpropagation algorithm with adaptive 3972. Springer, Berlin, Heidelberg.
learning rate and momentum coefficient, Proceedings of the [21] D.E. Rumelhart, J.L. McClelland. The PDP Research Group,
2002 International Joint Conference on Neural Networks Parallel Distributed Processing- Explorations in the
(IJCNN’02), vol. 2, 2002, pp. 1218–1223. Microstructure of Cognition, MIT Press, MA, 1986.
[8] Conference on Neural Networks (IJCNN’02), vol. 2, 2002, pp. [22] M.R Meybodi, H. Beigy. A note on learning automated-based
1218–1223. H.M. Shao, G.F. Zheng, A new BP algorithm with schemes for adaptation of BP parameters, Neurocomputing 48
adaptive momentum for FNNs training, in: 2009 WRI Global (2002) 957-974.
Congress on Intelligent Systems (GCIS’09), vol. 4, 2009, pp. [23] M. Gori, M. Maggini, Optimal convergence of on-line
16–20. backpropagation,IEEE Transactions on Neural Networks (1996)
[9] Levenberg, Kenneth (1944). "A Method for the Solution of 251–254.
Certain Non-Linear Problems in Least Squares". Quarterly of [24] V.V. Phansalkar,P.S. Sastry, Analysis of the back-propagation
Applied Mathematics. 2:164–168. algorithm with momentum, IEEE Transactions on Neural
[10] Marquardt, Donald (1963). "An Algorithm for Least-Squares Networks 5(3) (1994) 505–506.
Estimation of Nonlinear Parameters". SIAM Journal on Applied [25] M. Torii, M.T. Hagan,Stability of steepest descent with
Mathematics. 11 (2): 431–441. doi:10.1137/0111030. momentum for quadratic functions,IEEE Transactions on
[11] Duchi et al. (2012).“Adaptive Subgradient Methods for Online NeuralNetworks 13(3) (2002) 752–756.
Learning and Stochastic Optimization”. Journal of Machine [26] Z. Zeng , Analysis of global convergence and learning
Learning Research. 12, 2121-2159. parameters of the back-propagation algorithm for quadratic
[12] Xinxing Pan; Lee, B.; Chunrong Zhang, "A comparison of functions, Lecture Notes in Computer Science, vol. 4682, 2007,
neural network backpropagation algorithms for electricity load pp.7–13.
forecasting," Intelligent Energy Systems, 2013 IEEE [27] E. Istook,T. Martinez, Improved backpropagation learning in
International Workshop on, pp.22,27, 14-14 nerural networks with windowed momentum,International
[13] Zakaria Nouir, Berna Sayrac and Benoit Fourestie, Walid Journal of Neural System 12 (3–4) (2002) 303–318.
Tabbara and Francoise Brouaye, Comparison of Neural Network [28] Y. Bengio, N. Boulanger-Lewandowski and R. Pascanu,
Learning Algorithms for Prediction Enhancement of a Planning "Advances in optimizing recurrent networks," 2013 IEEE
Tool, 13th European Wireless Conference, 2007 International Conference on Acoustics, Speech and Signal
[14] Kisi O, Uncuoglu E. Comparison of the three backpropagation Processing, Vancouver, BC, 2013, pp. 8624-8628. doi:
training algorithms for two case studies. Indian J Eng Mater Sci 10.1109/ICASSP.2013.6639349
2005;12(5):434–42. [37] Hornik K. Multilayer feedforward [29] Pennington, J., Socher, R., & Manning, C. D. (2014). Glove:
networks are universal approximators. Neural Networks Global Vectors for Word Representation. Proceedings of the
1989;2:359–66. 2014 Conference on Empirical Methods in Natural Language
[15] C. B. Khadse, M. A. Chaudhari and V. B. Borghate, Processing, 1532–1543.
"Comparison between three back-propagation algorithms for [30] Clark, L. (2012, June 26). GOOGLE'S ARTIFICIAL BRAIN
power quality monitoring," 2015 Annual IEEE India Conference LEARNS TO FIND CAT VIDEOS. Retrieved from
(INDICON), New Delhi, 2015, pp.1-5. doi: https://www.wired.com/2012/06/google-x-neural-network/
10.1109/INDICON.2015.7443766

www.ijcit.com 54

View publication stats

You might also like