A Survey of Randomized Algorithms For Training Neural Networks
A Survey of Randomized Algorithms For Training Neural Networks
A Survey of Randomized Algorithms For Training Neural Networks
Information Sciences
journal homepage: www.elsevier.com/locate/ins
a r t i c l e i n f o a b s t r a c t
Article history: As a powerful tool for data regression and classification, neural networks have received
Received 10 April 2015 considerable attention from researchers in fields such as machine learning, statistics, com-
Revised 12 November 2015
puter vision and so on. There exists a large body of research work on network training,
Accepted 17 January 2016
among which most of them tune the parameters iteratively. Such methods often suffer
Available online 23 January 2016
from local minima and slow convergence. It has been shown that randomization based
Keywords: training methods can significantly boost the performance or efficiency of neural networks.
Randomized neural networks Among these methods, most approaches use randomization either to change the data dis-
Recurrent neural networks tributions, and/or to fix a part of the parameters or network configurations. This article
Convolutional neural networks presents a comprehensive survey of the earliest work and recent advances as well as some
Deep learning suggestions for future research.
© 2016 Elsevier Inc. All rights reserved.
1. Introduction
Inspired by biological Neural network, artificial neural network (ANN) is a family of non-parametric learning methods
for estimating or approximating functions that may depend on a large number of inputs and outputs. Typically, training
protocol of an ANN is based on minimizing a loss function defined on the desired output of the data and actual output of
the ANN through updating the parameters. Classical approaches usually tune the parameters based on the derivatives of the
loss function. However, much of the power of ANN comes from the nonlinear function in the hidden units used to model
the nonlinear mapping between the input and output. Unfortunately, this kind of architecture loses the elegance of finding
the global minimum solution with respect to all the parameters of the network since the loss function depends on the
output of nonlinear neurons. Thus, the optimization turns out to be nonlinear least square problem which is usually solved
iteratively. In this case, the error function has to be back propagated backwards to serve as a guidance for tuning the param-
eters [30]. Due to this, it is widely acknowledged that these training methods are very slow [38] and may not converge to a
single global minimum because there exist many local minima [29,53] and also the resulting neural network is very weak in
the real world noisy situations. These weaknesses of this family of methods naturally limit the applicability of gradient-based
algorithms for training neural networks. Randomization based methods remedy this problem by either randomly fixing the
network configurations (such as the connections) or some parts of the network parameters (while optimizing the rest by a
closed form solution or an iterative procedure), or randomly corrupt the input data or the parameters during the training.
Remarkable results have been achieved in various network structures, such as single hidden layer feed forward network [69],
∗
Corresponding author. Tel: +65 6705404; fax: +65 67933318.
E-mail address: epnsugan@ntu.edu.sg (P.N. Suganthan).
http://dx.doi.org/10.1016/j.ins.2016.01.039
0020-0255/© 2016 Elsevier Inc. All rights reserved.
L. Zhang, P.N. Suganthan / Information Sciences 364–365 (2016) 146–155 147
RBF neural networks [9], deep neural network with multiple hidden layers [31], convolutional neural network [43] and
so on.
A main goal of the paper is to show a role and a place of randomized methods in optimization based neural networks’
learning. In Section 2, we present some early work on this line of research on perceptron and standard feed-forward neural
network with random parameters in the hidden neuron. Another piece of important work is Random Vector Functional
Link Network, which is described in Section 3. Randomization based learning in RBF, recurrent neural network and deep
neural network are presented in Sections 4, 5, and 6, respectively. We also offer some details on other scenarios such as
evolutionary learning in Section 7. In Section 8, we point out some research gap in the literature of randomization algorithm
for neural network training. Conclusions are presented in the last section.
2. Early works on perceptron and standard feed-forward neural network with randomization
The earliest attempt in this research area was the “perceptron” presented in [65] and extended in [10,66]. Generally
speaking, a perceptron consists of a retina of sensor units, associator and response units. The sensor units are connected
to the associator units in “random and many to many” manner. The associator units may connect to other associator units
and/or response units. When a stimulus (or the input data) is presented to the sensor units, impulses are conducted from
the activated sensor units to the associator units. The associator units are activated once the total arrived signals exceed a
threshold. In this case, an impulse from the associator will be send to the units which are connected with it. In perceptron,
the weights between the sensor units and the response units can be regarded as randomly selected from {1, 0}, while the
weights between the associator units and the response units are achieved by reinforcement learning.
In [69], the authors investigated the performance of a standard feed-forward neural network (SLFN) which is demon-
strated in Fig. 1. In this paper, the weights between the input layer and hidden layer are randomly generated and kept
fixed. The author reported that the weights between the output layer and hidden layer are of more importance and the
rests may not need to be tuned once they are properly initialized.
For a given classification problem with limited training data, there are numerous solutions with different parameter
setting which is statistically acceptable. In this case, training become much easier because the learning set is only needed
to make a rough selection in the parameter space. Setting the parameters in the hidden neurons randomly helps to remove
the redundancy of the solution in parameter space and thus makes the solution less sensitive to the resulting parameters
compared with other typical learning rule such as back-propagation. In [69], the weights in hidden neurons are set to be
uniform random values in [−1, +1] and they suggest to optimize this range in a more appropriate range for the specified
application. An alternative choice is to set the hidden neurons to act as “correlators”, which means to fix the weights in
hidden neuron with a random subset of the training data.
In [69], the network’s output layer weights are optimized by minimizing the following squared error:
2
N
k
=
2
yi − w j fi j (1)
i=1 j=0
where N means the number of data samples and k is the number of hidden neurons. fij is the activation values of the jth
neuron on the ith data sample ( fi0 is the bias). yi is the target of the ith data sample.
Denote by
Fig. 1. The structure of SLFN in [69]. x means the input feature. The arrows within the yellow rectangle represents the random weights (whidden ) which
connect the input feature to the hidden neurons. Those arrows within the green rectangle are the output weights (wout put ) which need to be optimized. x0s
and f0s can be regarded as the bias term in the input and hidden layer. y is the desired output target. (For interpretation of the references to color in this
figure legend, the reader is referred to the web version of this article.)
148 L. Zhang, P.N. Suganthan / Information Sciences 364–365 (2016) 146–155
is the concatenated vector of activation values of the hidden layer for the ith data sample and the bias. The optimal output
layer weight vector, Woutput , can be easily derived by:
N
N
W out put = R−1 P ; R= Fi FiT ; P= yi FiT ; (3)
i=1 i=1
The above optimization problem can also be regarded as a linear regression which can be solved in a single step.
In [1], general dynamics of complex systems are realized by a feed-forward neural network with random weights. The
outputs of the network are feedback to the inputs to generate a time series. By investigating the percent of the systems that
exhibit chaos, the distribution of largest Lyapunov exponents, and the distribution of correlation dimensions, they show the
probability of chaos approaches unity and the correlation dimension is typically much smaller than the system dimension
as the system become more complex due to increasing inputs and neurons.
In [23],“Jacobian Neural Network” (JNN) is proposed which is a polynomial-time randomized algorithm that has prob-
ability 1 to give an optimal network. In JNN, the number of hidden neurons can also be learned. They consider a linear
combination of two networks where one of them is randomly generated and the other can be analytically achieved from
the first random neural network.
Another piece of pioneering work on training neural network with randomization can be found in [56]. The so-called
Random Vector Functional Link Neural Network (RVFL) model can be regarded as a semi-random realization of the Func-
tional Link neural networks, where the basic architecture is demonstrated in Fig. 2.
The rationale behind this architecture is to improve the generalization ability of the network with some enhanced fea-
tures which can be achieved by some transformation followed by nonlinearity on the original features. The weights aij from
the input to the enhancement nodes are randomly generated such that the activation function g(atj x + b j ) is not saturated
most of the time. For RVFL, only the output weights β j need to be optimized. Suppose the input data has k features and
there are J enhancement neurons, there are in total k + J inputs for each output node. Learning is achieved by minimizing
the following expression:
1 i
N
E= (t − Bt di )2 (4)
2N
i=1
where Bt consists of the weight values β j , j = 1, 2, . . . , k + J. There are N input data in total. E is quadratic with respect to
each k + J dimensional vector β j , indicating that the unique minimum can be found in no more than k + J iterations of the
learning procedure such as the conjugate gradient method. For simplicity, let o p = β j x p j be the output of the pth data,
and then the changes in the weight are set to be β p j = η (t p − o p )x p j . In the (k + 1 )th iteration, the weights are updated
as β j (k + 1 ) = β j (k ) + p β p j . The learning procedure iterates until a stopping criterion is met. An alternative solution
within a single learning step can be achieved by the Moore–Penrose pseudo-inverse [37,55]. In this case, the weight B can
be achieved by B = td+ , where + represents the Moore–Penrose pseudo-inverse.
In [37], some theoretical justifications can be found for RVFL as well as other neural networks with hidden nodes
implemented as product of univariate functions or radial basis functions. They formulate the problem as a limit-integral
Fig. 2. The structure of RVFL. The input features are firstly transformed into the enhanced features by the enhancement nodes where parameters within
them are randomly generated. All the original and enhanced features are connected to the output neurons.
L. Zhang, P.N. Suganthan / Information Sciences 364–365 (2016) 146–155 149
representation of the function to be approximated. Then the limit-integral representation is approximated by using the
Monte Carlo method. It has been proven that with weights and biases from input layer to hidden layer sampled from uni-
form distribution with a proper range, RVFL is efficient universal approximator for continuous functions on bounded finite
dimensional sets. Indeed the overall error of the approximation can be bounded by sum of the error of approximating the
function by the integral and the error of the approximating the integral by the Monte Carlo methods.
A comprehensive evaluation of RVFL was conducted [78]. The authors show that the direct link from the input to output
layer plays a key role in the classification ability of RVFL. Moreover, it is advantageous to tune the range of the distribution
of the randomized parameters. In the context of time series forecasting, the work in [63] shows that the direct links in
RVFL improve the performance in a statistically significant manner. The direct links can be compared to the time delayed
line in the finite impulse response filter (FIR). Moreover, the authors in [63,78] show that randomization range can be
tuned to improve performance. An application of RVFL for optimal control can be found in [55]. In [15], the author shows
the maximum number of hidden nodes is N − r − 1 for an RVFL with a constant bias to learn a mapping within a given
precision based on an n-dimensional N-pattern data set, where r is the rank of the data set. An online learning method is
also proposed. Furthermore, a robust weighted least square method is investigated in order to eliminate outliers. In [16], a
dynamic stepwise updating algorithm was proposed when a new enhancement node or a new data was added based on
the same pseudo-inverse solution. In [16] , authors also proposed several methods to refine the model to deal with the
small singular value problem. It is widely accepted that the small singular values may be caused by noise in the data or
round-off errors during computations. Small singular values will result in very large weights which will further amplify the
noise in test data. Some potential solutions in [16] include: (1) investigating an upper bound on the weights. (2)Cutting off
the singular values and investigating the relation between the cutoff values and the performance of the network in terms
of prediction error. (3) Orthogonal least squares learning method or a regularization method or cross-validation methods.
These ideas are further explained in [44].
In [36], authors report that compared with RVFL, Multilayer perceptron (MLP) gives a closer approximation to a given
function f. Moreover, the functional dependence of the approximation error on the complexity of the model is in both cases
of √1 , where N is the number of the hidden neurons. Thus, both the RVFL and MLP are efficient approximators which
N
avoid an exponential increase in n. In the same work, by combining the RVFL with the expectation maximization (EM)
method, the author proposed the GM-RVFL method and improvements can be observed. In [2], the author trains a pool
of decorrelated RVFL within the negative correlation learning framework to obtain an ensemble classifier. In [46], two
algorithms for training RVFL are proposed where training data is distributed throughout a network of agents.
Another neural network architecture is radial basis function network (RBF network), which was brought to attention in
the neural network community by [12,52,59], became popular because of its efficiency and ease of training. The RBF net-
work shares a similar architecture as the standard multilayer feedforward neural network. The difference lies in the inputs
of the hidden neuron of an RBFnetwork which is the sum of distance between the input pattern and the “center” of the
basis function, instead of the weighted sum of the input as in an ANN. In [12], the author demonstrates that the RBF net-
work is sufficent to represent arbitrary nonlinear transformation as determined by a finite training data set of input–output
patterns, where the centers of the RBFs can be selected from the training data or randomly generated while randomly
generated centers may not reflect the data distribution, thereby leading to a poorer performance. To be more specific, sup-
pose the training data is (xi , yi ), i = 1, 2, 3, . . . ,N, the activation function of the jth hidden neuron for the tth sample is
s j (xt ) = Ni=1 φ (xi − ci ), where ci is the center of this RBF. Thus, the problem can be formulated by the following linear
equations:
Y = β
where Y = [y1 , y2 , . . . ,yN ]
⎡ ⎤
φ11 ··· φ1 N
⎢ ⎥ (5)
= ⎣ ... ..
.
..
. ⎦
φN1 ··· φNN
φi j = φ ( x i − c j )
Given the existence of the inverse of matrix , the weights β which connect the RBFs to the output node can be achieved
by β = −1Y . Micchelli proved [58] that for all N and for a large function of φ , the matrix is non-singular if the data
points are all distinct.
The above equations offer an efficient solution to the case where the number of RBFs is equal to number of distinct data
samples. In practice, one may be interested in obtaining the same performance with minimum model complexity, that is –
with a minimum number of RBFs. In this case, the matrix may no longer be square. With the “minimum number of RBFs”
constraints, the problem usually becomes over-specified (which means the number of the RBFs is smaller than the number
150 L. Zhang, P.N. Suganthan / Information Sciences 364–365 (2016) 146–155
of training samples). Thus, a unique inverse on longer exists and we shall adopt a minimum norm least square method by
employing the Moore–Penrose pseudo-inverse, + of .
Recurrent neural network (RNN) is another type of neural networks. Different from feedforward neural network, where
activations is piped through the network from the input neurons to output neurons, an RNN has at least one cyclic path of
synaptic connections. One typical structure of RNN can be found in Fig. 3.
A random RNN is a set of N fully connected neurons. The weights that connect the neurons are randomly initialized. The
states of the neurons are X (t ) = xi (t ), i = 1, . . ., N, where each xi is set proportional to the firing frequency of the ith neuron.
The state dynamics can be modeled by the following discrete time RBF equations:
N
∀t > 0, ∀i = 1, . . ., N, xi (t + 1 ) = f wi j x j (t ) + Ii − i ; (6)
j=1
where f is the activation function, I is N dimensional constant vector and is randomly generated. Time varying extensions
of I can be found in [9]. There exist many random RNNs in the literature. Among them, Liquid State Machine [50] randomly
fixes the input weights and internal weights, while the output weights are learned by the well-known recursive least squares
(RLS) algorithm. The liquid state machine uses spiking activation functions. The Echo State Network [39] works in a fairly
similar manner as the liquid state machine. But, the echo state uses continuous models as activation in neurons. Readers
are referred to [3,72] for various extensions along this line of research.
Deep structures have become a hot research topic since the work in [31]. In deep learning, deep structures with multiple
layers of non-linear operations are able to learn high-level abstractions. Such high-level abstraction is the key factor leading
to the success of many state-of-the-art systems in vision, language, and other AI-level tasks. Complex training algorithms
combined with carefully chosen parameters (i.e, learning rate, mini-batch size, number of epochs) can lead to a deep neural
network (DNN) with high-performance.
Autoencoder [7,31], which is a basic block of recent deep neural networks, can be decomposed into two parts: encoder
and decoder. An encoder is a deterministic mapping that maps the input x to a hidden representation y through: f (x ) =
s(W x + b) where = {W, b}, and s is a non-linear activation function such as the sigmoid. In the decoder, the hidden
representation is then mapped back to reconstruct the input x. This map is achieved by g (y ) = s(W y + b ). Fig. 4 shows
the structure of a denoising autoencoder. In a denoising autoencoder, firstly the input is corrupted by some random noise
and the autoencoder aims to reconstruct the “clean” input. In [73], the author proposed the stacked denoising autoencoders
which were trained locally to denoise corrupted versions of their inputs. It was shown on a benchmark of classification
problems to yield significantly lower classification error than the stacked autoencoder. These works clearly establish the
L. Zhang, P.N. Suganthan / Information Sciences 364–365 (2016) 146–155 151
X X̃
Corrupt
gθ
Decode
z y
Fig. 4. Basic structure of a denoising autoencoder.
value of using a randomization based denoising criterion as a tractable unsupervised objective to guide the learning in deep
neural networks.
Convolutional Neural Network (CNN) [43] is another neural network model which has been successfully applied to solve
many tasks such as digit, object and speech recognition. CNN combines three architectural ideas to ensure some degrees
of shift, scale and distortion invariances: shared weights, sub-sampling and local receptive fields. A classic CNN, Lenet-5, is
shown in Fig. 5.
CNN is composed of alternating convolutional and pooling layers. The ideas of “shared weights” and “local receptive
fields” are involved in the convolutional layers. A filter of a pre-defined size (e.g. 5 × 5 or 7 × 7) is convolved with the
input to obtain the feature map. In order to boost the performance of a CNN, a common approach is to let the network
learn an “over-complete” set of feature maps. We can easily stack this architecture into deep architectures by setting the
output of one pooling layer to be the input of another convolutional layer. In this case, inputs of each filter in the higher
layer can be randomly connected to the output of the lower pooling layer. This kind of randomization based connection is
well studied in [25,43].
In [40], the authors show that the exact values of the filters in CNN is less important than the architecture. They re-
port that a two-stage system with random filters can yield satisfactory results, provided that the proper non-linearities and
pooling layers are used. This surprising finding has also been verified by [57], where thousands of convolutional pooling
architectures on a number of object recognition tasks are evaluated and the random weights are found to be only slightly
worse than those with pretrained weights. The work in [68] further address this phenomenon and found certain convo-
lutional pooling architectures can be inherently frequency selective and translation invariant even with random weights.
Thus, a practical method is proposed for fast model selection. They show that the performance of single layer convolu-
tional square pooling networks with random weights is significantly correlated with the performance of such architectures
after pretraining and finetuning. Thus, the use of random weights for architecture search will improve the performance of
the state-of-the-art systems. Randomization method can also be employed in the pooling layer of the CNN to improve the
performance significantly. In [19], selecting local receptive fields is proposed where some features are randomly selected
firstly, then the local receptive fields are selected to make sure that each feature within it is most similar to each other
according to a pairwise similarity metric. In [77], the authors proposed to randomly pick the activation within each pooling
region according to a multinomial distribution, given by the activations within the pooling region. Such randomization based
pooling instead of conventional deterministic pooling operations achieve state-of-the-art performance on several benchmark
datasets.
As regressors, neural network models the conditional distribution of the output variables Y given the input variables
X. Multi-modal conditional distribution is proposed in [71] by using stochastic hidden variables rather than deterministic
152 L. Zhang, P.N. Suganthan / Information Sciences 364–365 (2016) 146–155
ones. A new Generalized EM training procedure using importance sampling is proposed to train a stochastic feedforward
network with hidden layers composed of both deterministic and stochastic variables to efficiently learn complicated condi-
tional distributions. They achieve superior performance on synthetic and facial expressions datasets compared to conditional
restricted Boltzmann machines and mixture density networks.
In [8], the authors report empirically and theoretically that randomly chosen trials are more efficient for hyper-parameter
optimization than trials on a grid for deep neural networks. Compared with neural networks configured by a pure grid
search, random search over the same domain is able to find models that are on par with or even better within a small
fraction of the computation time. With the same computational budget, random search may find better models by effectively
searching a larger as well as less promising configuration space. Gaussian process analysis of the function from hyper-
parameters to validation set performance is employed. It reveals that for most data sets only a few of the hyper-parameters
really matter. However, different hyper-parameters may be important on different data sets, which makes the commonly
adopted grid search approach a poor choice for configuring algorithms for new data sets. In [47], the authors proposed to
multiply error signals by random synaptic weights in back-propagation for deep neural networks. They demonstrated that
this new mechanism performs as quickly and accurately as back-propagation on a variety of problems.
It is well known that a large feed-forward neural network trained on a small training set will typically perform poorly on
held-out test data. To tackle this problem, the authors in [70] propose a method called “dropout” to prevent co-adaptation
of feature detectors. In their method, The key idea is to randomly drop units with a pre-defined probability (along with
their connections) from a neural network during training. This prevented the units from co-adapting too much. Dropping
units created thinned networks during training. Dropout can be seen as an extreme form of bagging in which each model is
trained on a single case and each parameter of the model is very strongly regularized by sharing it with the corresponding
parameter in all the other models. During testing, all possible thinned networks were combined using an approximate model
averaging procedure. The idea of dropout is not limited to feed-forward neural networks. It can be more generally applied
to graphical models such as Boltzmann Machines. Random dropout yielded big improvements on many benchmark tasks
and sets new record accuracies in speech and object recognition tasks [70].
In [74], DropConnect network was proposed for regularizing large fully-connected layers within neural networks which
can be regarded as a generalization of drop out. DropConnect can be regarded as a larger ensemble of deep neural net-
work than dropout. When training with Dropout, a randomly selected subset of activations are set to zero within each layer.
DropConnect instead sets a randomly selected subset of weights within the network to zero. In the testing time, DropCon-
nect network uses a sampling based inference which was shown to be better than the mean-inference used by dropout.
The authors also give some theoretical insights on why DropConnect regularizes the network [74]. They proved that the
Rademacher complexity [6] of the DropConnect is less than the standard models.
Multilayer bootstrap network (MBN) [79] is another deep neural network model where each layer of the network is a
group of mutually independent k- clusters of which centers are randomly sampled data points. In [79], the author shows
the relationship between MBN and product of experts, contrastive divergence learning and sparse coding.
Randomization based methods can also be employed in other network configurations. In [27], a special kind of random
neural network is proposed. For other neural networks, the activation of neurons is either binary or continuous variables.
For random neural network, each neuron is represented by its potential (which is a non-negative integer) and a neuron is
considered to be in its firing state if its potential is positive. Readers are referred to [28] for a comprehensive review about
training this family of randomized feedforward neural network.
Kernel machines such as the Support Vector Machine are attractive because of their excellent generalization ability. How-
ever, one drawback of such method is that the kernel matrix (Gram matrix) scales poorly with the size of the training data.
In [61], the authors proposed to map the input data to a randomized low-dimensional feature space where the inner prod-
ucts uniformly approximate many popular kernels. The proposed random features are demonstrated to be powerful and
economical for large scale learning. Another research in [20] scaled up the kernel methods by a novel concept called “dou-
bly stochastic functional gradient”. It is well known that many kernel methods can be solved by convex optimization algo-
rithms. Dai et al [20] solved the optimization problems by making two unbiased stochastic approximations to the functional
gradient. One used random training points and another used random features associated with the kernel. Fast convergence
rate and tight bound were reported in their method.
In [62], motivated by the fact that randomization is computationally cheaper than optimization based learning, the au-
thors proposed to replace the minimization with randomization in a classifier similar to kernel machines, which computes
a weighted sum of their inputs after passing them through a pool of arbitrarily randomized nonlinearities. The error bound
can be roughly decomposed into 2 parts. The first part of the error bound indicates that the lowest true risk attainable by a
function in the family of the randomized classifiers is close to the lowest true risk attainable. The second part of the error
bound shows that the true risk of every function in the family of the randomized classifiers is close to its empirical risk.
In [5], the author showed that the conventional back propagation would not work on neural networks whose neurons
have hard-limiting input-output characteristics. In this case, the derivatives of the error function are not available. However,
the authors showed if the network weights are random variables with smooth distribution functions, the probability of a
hard-limiting unit taking one of its two possible values is a continuously differentiable function.
L. Zhang, P.N. Suganthan / Information Sciences 364–365 (2016) 146–155 153
It is not difficult to find many attempts to train neural networks by using evolutionary algorithms. Evolutionary algo-
rithm is a special randomization method inspired by biological evolution, such as reproduction, mutation, recombination,
and selection. Population-based algorithms, such as genetic algorithm [33,34], particle swarm optimization [41,51], immune
approaches [21,75], multi-objective optimization [18,76], are known to improve the training of neural network. Other ap-
proaches such as simulated annealing and metropolis have also been successfully applied for neural network training [11,24].
8. Future directions
Though a large body of literature has been published to exploit randomization for training neural networks, there
are still several research gaps in this field. One can investigate the effects of a specific randomization scheme (sparse
or non-sparse, uniform or Gaussian, etc.) on different neural network classes by using the compressive sensing the-
ory [4,13,14,45]. Another major gap is the lack of extensive experimental comparisons between different randomization based
networks [26,39,50,54,60,69]. Randomized neural networks for ensemble learning are also under-researched. It is widely ac-
cepted that ensemble methods can benefit from low bias and high variance learner [64]. Thus, it worth investigating how
to integrate randomized neural networks by using ensemble learning frameworks.
In [32], the authors investigated the approximation ability of standard multilayer feedforward networks with as few
as a single hidden layer. They showed when the activation function is bounded and nonconstant, the standard multilayer
feedforward networks are universal approximators. In [37], the authors showed when the parameters of the hidden layers
are randomly sampled from a uniform distribution within a proper range, the resulting neural network, RVFL, is a universal
approximator for a continuous function on a bounded finite dimensional set with efficient convergence rate. However, how
to optimize the range for the random parameters remains untouched in the literature. Hence, this is another area where
further research is required. Also, we should investigate the random parameter settings in other models such as polynomial,
wavelet and so on.
“Big data” is a hot topic recently leading to an upsurge of research. Deep learning [31] has been gaining its popularity
in machine learning and computer vision community. It has been shown that high-level abstraction which comes from
deep structure is a key factor leading to the success of many state-of-the-art systems in vision, language, and other AI-
level tasks. Deep networks are much more difficult to train than shallow ones because they need a relatively large training
data set to tune a large number of parameters. Despite the surge in interest in deep networks as more large scale data
becomes available [22,35], the theoretical aspects have remained under-researched. In [42], the authors showed that deep
but narrow networks trained with a greedy layer-wise unsupervised learning algorithm do not require more parameters
than shallow ones to achieve universal approximation. However, when it comes to deep random neural network, it remains
unclear regarding how to set the random parameters. Most importantly, the performance gap between deep random neural
network and deep neural network trained with back propagation is wide in favor of back propagation. Moreover, when it
comes to bigdata or large-scale learning, high dimensional feature space is usually preferred [17]. In this case, commonly
used approach for randomized neural network in prime space such as Eq. (3) whose complexity is between quadratic and
cubic with respect to the data samples may become computationally intractable. Thus, one may alternatively turn to other
approaches such as conjugate gradient [49]. On the other hand, solutions in the dual space [67] has a time complexity
of O(n2 p) (p is the number of features and n is the number of data samples), which can also be very slow. In this case,
randomized approximation methods such as [48] and [61] can be more efficient and reliable.
9. Conclusion
In this article, we presented an extensive survey on randomized methods for training neural networks, the use of ran-
domization in kernel machines and related topics. We divided this family into several methods based on the network con-
figuration. We believe that, this article, the first survey on randomized methods for training neural network, offers valuable
insights into this important research topic. We also offered several potential future research directions. We trust that this
article will encourage further advancements in this field.
Acknowledgment
Authors wish to thank Guest Editor Associate Professor Dianhui Wang and reviewers for providing us with their valuable
comments.
References
[1] D.J. Albers, Dynamical behavior of artificial neural networks with random weights, in: Intelligent Engineering Systems Through Artificial NeuralNet-
works, ASME Digital Collection, 1996, pp. 17–22.
[2] M. Alhamdoosh, D. Wang, Fast decorrelated neural network ensembles with random weights, Inf. Sci. 264 (2014) 104–117.
[3] H. Bakırcıoğlu, T. Koçak, Survey of random neural network applications, Eur. J. Oper. Res. 126 (2) (20 0 0) 319–330.
[4] R. Baraniuk, Compressive sensing, IEEE Signal Process. Mag. 24 (4) (2007).
[5] P.L. Barlett, T. Downs, Using random weights to train multilayer networks of hard-limiting units, IEEE Trans. Neural Netw. 3 (2) (1992) 202–210.
[6] P.L. Bartlett, S. Mendelson, Rademacher and gaussian complexities: Risk bounds and structural results, J. Mach. Learn. Res. 3 (2003) 463–482.
[7] Y. Bengio, Y. LeCun, et al., Scaling learning algorithms towards AI, Large-Scale Kernel Mach. 34 (5) (2007).
154 L. Zhang, P.N. Suganthan / Information Sciences 364–365 (2016) 146–155
[8] J. Bergstra, Y. Bengio, Random search for hyper-parameter optimization, J. Mach. Learn. Res. 13 (1) (2012) 281–305.
[9] H. Berry, M. Quoy, Structure and dynamics of random recurrent neural networks, Adapt. Behav. 14 (2) (2006) 129–137.
[10] H.D. Block, The perceptron: A model for brain functioning. i, Rev. Mod. Phys. 34 (1962) 123–135.
[11] K.D. Boese, A.B. Kahng, Simulated annealing of neural networks: the cooling’strategy reconsidered, in: Proceedings of the IEEE International Symposium
on Circuits and Systems, IEEE, 1993, pp. 2572–2575.
[12] D.S. Broomhead, D. Lowe, Radial basis functions, multi-variable functional interpolation and adaptive networks, Technical Report, DTIC Document,
1988.
[13] E.J. Candes, T. Tao, Decoding by linear programming, IEEE Trans. Inf. Theory 51 (12) (2005) 4203–4215.
[14] E.J. Candes, T. Tao, Near-optimal signal recovery from random projections: Universal encoding strategies? IEEE Trans. Inf. Theory 52 (12) (2006) 5406–
5425.
[15] C.P. Chen, A rapid supervised learning neural network for function interpolation and approximation, IEEE Trans. Neural Netw. 7 (5) (1996) 1220–1230.
[16] C.P. Chen, J.Z. Wan, A rapid learning and dynamic stepwise updating algorithm for flat neural networks and the application to time-series prediction,
IEEE Trans. Syst., Man, Cybern., Part B: Cybern. 29 (1) (1999) 62–72.
[17] D. Chen, X. Cao, F. Wen, J. Sun, Blessing of dimensionality: High-dimensional feature and its efficient compression for face verification, in: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2013, pp. 3025–3032.
[18] S.C. Chiam, K.C. Tan, A. Al Mamun, Multiobjective evolutionary neural networks for time series forecasting, in: Evolutionary Multi-Criterion Optimiza-
tion, Springer, 2007, pp. 346–360.
[19] A. Coates, A.Y. Ng, Selecting receptive fields in deep networks, in: Proceedings of the Advances in Neural Information Processing Systems, 2011,
pp. 2528–2536.
[20] B. Dai, B. Xie, N. He, Y. Liang, A. Raj, M.-F. F. Balcan, L. Song, Scalable kernel methods via doubly stochastic gradients, in: Proceedings of the Advances
in Neural Information Processing Systems, 2014, pp. 3041–3049.
[21] L.N. de Castro, F.J. Von Zuben, Immune and neural network models: theoretical and empirical comparisons, Int. J. Comput. Intell. Appl. 1 (03) (2001)
239–257.
[22] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR’09), IEEE, 2009, pp. 248–255.
[23] A. Elisseeff, H. Paugam-Moisy, JNN, a randomized algorithm for training multilayer networks in polynomial time, Neurocomputing 29 (1) (1999) 3–24.
[24] J. Engel, Teaching feed-forward neural networks by simulated annealing, Complex Syst. 2 (6) (1988) 641–648.
[25] J. Fan, W. Xu, Y. Wu, Y. Gong, Human tracking using convolutional neural networks, IEEE Trans. Neural Netw. 21 (10) (2010) 1610–1623.
[26] M. Fernández-Delgado, E. Cernadas, S. Barro, D. Amorim, Do we need hundreds of classifiers to solve real world classification problems? J. Mach.
Learn. Res. 15 (1) (2014) 3133–3181.
[27] E. Gelenbe, Random neural networks with negative and positive signals and product form solution, Neural Comput. 1 (4) (1989) 502–510.
[28] M. Georgiopoulos, C. Li, T. Kocak, Learning in the feed-forward random neural network: A critical review, Perform. Eval. 68 (4) (2011) 361–384.
[29] M. Gori, A. Tesi, On the problem of local minima in backpropagation, IEEE Trans. Pattern Anal. Mach. Intell. 14 (1) (1992) 76–86.
[30] S. Haykin, N. Network, Neural Networks: A comprehensive foundation, Neural Netw. 2 (2004) (2004).
[31] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science 313 (5786) (2006) 504–507.
[32] K. Hornik, Approximation capabilities of multilayer feedforward networks, Neural Netw. 4 (2) (1991) 251–257.
[33] Y.-C. Hu, Functional-link nets with genetic-algorithm-based learning for robust nonlinear interval regression analysis, Neurocomputing 72 (7) (2009)
1808–1816.
[34] Y.-C. Hu, F.-M. Tseng, Functional-link net with fuzzy integral for bankruptcy prediction, Neurocomputing 70 (16) (2007) 2959–2968.
[35] G. Huang, M. Ramesh, T. Berg, E. Learned-Miller, Labeled faces in the wild: A database for studying face recognition in unconstrained environments,
Technical Report 07-49, University of Massachusetts, Amherst, MA, 2007.
[36] D. Husmeier, J.G. Taylor, Neural networks for predicting conditional probability densities: Improved training scheme combining EM and RVFL, Neural
Netw. 11 (1) (1998) 89–116.
[37] B. Igelnik, Y.-H. Pao, Stochastic choice of basis functions in adaptive function approximation and the functional-link net, IEEE Trans. Neural Netw. 6
(6) (1995) 1320–1329.
[38] R.A. Jacobs, Increased rates of convergence through learning rate adaptation, Neural Netw. 1 (4) (1988) 295–307.
[39] H. Jaeger, Adaptive nonlinear system identification with echo state networks, in: Proceedings of the Advances in Neural Information Processing Sys-
tems, 2002, pp. 593–600.
[40] K. Jarrett, K. Kavukcuoglu, M. Ranzato, Y. LeCun, What is the best multi-stage architecture for object recognition? in: Proceedings of IEEE 12th Inter-
national Conference on Computer Vision, IEEE, 2009, pp. 2146–2153.
[41] J. Kennedy, Particle swarm optimization, in: Encyclopedia of Machine Learning, Springer, 2010, pp. 760–766.
[42] N. Le Roux, Y. Bengio, Deep belief networks are compact universal approximators, Neural Comput. 22 (8) (2010) 2192–2207.
[43] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE 86 (11) (1998) 2278–2324.
[44] H. Li, C.P. Chen, H.-P. Huang, Fuzzy Neural Intelligent Systems: Mathematical Foundation and the Applications in Engineering, CRC Press, 2010.
[45] P. Li, T.J. Hastie, K.W. Church, Very sparse random projections, in: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, ACM, 2006, pp. 287–296.
[46] W. Li, D. Wang, T. Chai, Multisource data ensemble modeling for clinker free lime content estimate in rotary kiln sintering processes, IEEE Trans. Syst.,
Man, Cybern.: Syst. 45 (2) (2015) 303–314.
[47] T.P. Lillicrap, D. Cownden, D.B. Tweed, C.J. Akerman, Random feedback weights support learning in deep neural networks, arXiv preprint
arXiv:1411.0247(2014).
[48] Y. Lu, P. Dhillon, D.P. Foster, L. Ungar, Faster ridge regression via the subsampled randomized hadamard transform, in: Proceedings of the Advances in
Neural Information Processing Systems, 2013, pp. 369–377.
[49] D.G. Luenberger, Introduction to Linear and Nonlinear Programming, 28, Addison-Wesley Reading, MA, 1973.
[50] W. Maass, T. Natschläger, H. Markram, Real-time computing without stable states: A new framework for neural computation based on perturbations,
Neural Comput. 14 (11) (2002) 2531–2560.
[51] R. Mendes, P. Cortez, M. Rocha, J. Neves, Particle swarms for feedforward neural network training, Learning 6 (1) (2002).
[52] J. Moody, C. Darken, Learning with Localized Receptive Fields, Yale Univ., Department of Computer Science, 1988.
[53] Y.-H. Pao, Adaptive Pattern Recognition and Neural Networks, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1989.
[54] Y.-H. Pao, G.-H. Park, D.J. Sobajic, Learning and generalization characteristics of the random vector functional-link net, Neurocomputing 6 (2) (1994)
163–180.
[55] Y.-H. Pao, S.M. Phillips, The functional link net and learning optimal control, Neurocomputing 9 (2) (1995) 149–164.
[56] Y.-H. Pao, Y. Takefuji, Functional-link net computing, IEEE Comput. 25 (5) (1992) 76–79.
[57] N. Pinto, D. Doukhan, J.J. DiCarlo, D.D. Cox, A high-throughput screening approach to discovering good forms of biologically inspired visual represen-
tation, PLoS Comput Biol 5 (11) (2009) e1000579, doi:10.1371/journal.pcbi.1000579.
[58] T. Poggio, F. Girosi, A theory of networks for approximation and learning, Technical Report, DTIC Document, 1989.
[59] T. Poggio, F. Girosi, Networks for approximation and learning, Proc. IEEE 78 (9) (1990) 1481–1497.
[60] J.C. Principe, B. Chen, Universal approximation with convex optimization: Gimmick or reality?[discussion forum], IEEE Comput. Intell. Mag. 10 (2)
(2015) 68–77.
L. Zhang, P.N. Suganthan / Information Sciences 364–365 (2016) 146–155 155
[61] A. Rahimi, B. Recht, Random features for large-scale kernel machines, in: Proceedings of the Advances in Neural Information Processing Systems, 2007,
pp. 1177–1184.
[62] A. Rahimi, B. Recht, Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning, in: Proceedings of the Advances
in Neural Information Processing Systems, 2009, pp. 1313–1320.
[63] Y. Ren, P.N. Suganthan, N. Srikanth, G. Amaratungac, Random vector functional link network for short-term electricity load demand forecasting, Inf.
Sci. (2015), doi:10.1016/j.ins.2015.11.039.
[64] Y. Ren, L. Zhang, P.N. Suganthan, Ensemble classification and regression – recent developments, applications and future directions, IEEE Comput. Intell.
Mag. (2015), doi:10.1109/MCI.2015.2471235.
[65] F. Rosenblatt, The perceptron: a probabilistic model for information storage and organization in the brain., Psychol. Rev. 65 (6) (1958) 386.
[66] F. Rosenblatt, Principles of neurodynamics. perceptrons and the theory of brain mechanisms, Technical Report, DTIC Document, 1961.
[67] C. Saunders, A. Gammerman, V. Vovk, Ridge regression learning algorithm in dual variables, in: Proceedings of the 15th International Conference on
Machine Learning, Morgan Kaufmann, 1998, pp. 515–521.
[68] A. Saxe, P.W. Koh, Z. Chen, M. Bhand, B. Suresh, A.Y. Ng, On random weights and unsupervised feature learning, in: Proceedings of the 28th Interna-
tional Conference on Machine Learning (ICML-11), 2011, pp. 1089–1096.
[69] W.F. Schmidt, M.A. Kraaijveld, R.P. Duin, Feedforward neural networks with random weights, in: Proceedings of the 11th IAPR International Conference
on Pattern Recognition„ IEEE, 1992, pp. 1–4.
[70] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, J. Mach.
Learn. Res. 15 (1) (2014) 1929–1958.
[71] Y. Tang, R.R. Salakhutdinov, Learning stochastic feedforward neural networks, in: Proceedings of the Advances in Neural Information Processing Sys-
tems, 2013, pp. 530–538.
[72] S. Timotheou, The random neural network: a survey, Comput. J. 53 (3) (2010) 251–267.
[73] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.-A. Manzagol, Stacked denoising autoencoders: Learning useful representations in a deep network with
a local denoising criterion, J. Mach. Learn. Res. 11 (2010) 3371–3408.
[74] L. Wan, M. Zeiler, S. Zhang, Y.L. Cun, R. Fergus, Regularization of neural networks using dropconnect, in: Proceedings of the 30th International Confer-
ence on Machine Learning (ICML-13), 2013, pp. 1058–1066.
[75] L. Wang, M. Courant, A novel neural network based on immunity., in: Proceedings of the IC-AI, Citeseer, 2002, pp. 147–153.
[76] J.P.T. Yusiong, P.C. Naval Jr, Training neural networks using multiobjective particle swarm optimization, in: Advances in Natural Computation, Springer,
2006, pp. 879–888.
[77] M.D. Zeiler, R. Fergus, Stochastic pooling for regularization of deep convolutional neural networks, in: Proceedings of the International Conference on
Learning Representations, 2013.
[78] L. Zhang, P.N. Suganthan, A comprehensive evaluation of random vector functional link networks, Inf. Sci. (2015), doi:10.1016/j.ins.2015.09.025.
[79] X. Zhang, Nonlinear dimensionality reduction of data by deep distributed random samplings, in: Proceedings of the Sixth Asian Conference on Machine
Learning, 2014.