Zlib - Pub - Swarm Intelligence Methods For Statistical Regression
Zlib - Pub - Swarm Intelligence Methods For Statistical Regression
Methods for
Statistical
Regression
Swarm Intelligence
Methods for
Statistical
Regression
Soumya D. Mohanty
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher
cannot assume responsibility for the validity of all materials or the consequences of their use. The
authors and publishers have attempted to trace the copyright holders of all material reproduced in
this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged, please write and let us know
so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.
copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC),
222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that
provides licenses and registration for a variety of users. For organizations that have been granted a
photocopy license by the CCC, a separate system of payment has been arranged.
Preface xi
Chapter 1 Introduction 1
2.1 TERMINOLOGY 20
2.2 CONVEX AND NON-CONVEX
OPTIMIZATION PROBLEMS 21
vii
viii Contents
3.1 OVERVIEW 37
3.2 EVOLUTIONARY COMPUTATION 39
3.3 SWARM INTELLIGENCE 41
3.4 NOTES 42
Bibliography 111
Index 117
Preface
xi
xii Preface
∀ For all.
A∝B A is proportional to B.
Pj
k=i Summation of quantities indexed by inte-
ger k.
Qj
k=i Product of quantities indexed by integer k.
B A set. Small sets will be shown explic-
itly, when needed, as {a, b, . . .}. Otherwise,
{a | C} will denote the set of elements for
which the statement C is true. A set of in-
dexed elements, such as {a0 , a1 , . . .}, will
be denoted by {ai }, i = 0, 1, . . ., where
needed.
α∈A α is an element of A.
N An integer.
A×B Direct product of A and B. It is the set
{(x, y)|x ∈ A, y ∈ Y}.
A2 The set A × A with each element, called a
2-tuple, of the form (α ∈ A, β ∈ A).
AN For N ≥ 2, the set A × AN −1 with A1 = A.
Each element, called an N -tuple, is of the
form (α0 , α1 , . . . , αN −1 ) with αi ∈ A ∀i.
xv
xvi Conventions and Notation
Introduction
CONTENTS
1.1 Optimization in statistical analysis . . . . . . . . . . . . . . 1
1.2 Statistical analysis: Brief overview . . . . . . . . . . . . . . 3
1.3 Statistical regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Parametric regression . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Non-parametric regression . . . . . . . . . . . . . . . 8
1.4 Hypotheses testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.1 Noise in the independent variable . . . . . . . 15
1.5.2 Statistical analysis and machine learning 16
1
2 Swarm Intelligence Methods for Statistical Regression
than one model. So, the decision must take into account the
variation of models within each set. We discuss hypothesis
testing in further detail in Sec. 1.4.
Methods for density estimation can be divided, rather
loosely, into parametric and non-parametric ones. This divi-
sion is essentially based on how strong the assumptions em-
bodied in the models of pZ are before the data is obtained.
If the set of models is prescribed independently of the data,
we get a purely parametric method. If pZ (z) is determined
from the data itself without strong prior assumptions, we get
a non-parametric method.
For example, deciding ahead of obtaining the data that
pZ (z) is a multivariate normal pdf (see App. A), with only
its parameters, namely the mean vector and/or the covari-
ance matrix, being unknown, yields an example of a para-
metric method. In contrast, inferring pZ (z) by making a (pos-
sibly multi-dimensional) histogram is an example of a non-
parametric method.
Strictly speaking, non-parametric methods are not
parameter-free since they do contain parameters, such as the
size and number of bins in the histogram example above. The
main distinction is that the assumptions made about pZ (z)
are much weaker in the non-parametric approach. It is best
to see parametric and non-parametric methods as lying at the
extreme ends of a continuum of approaches in statistical anal-
ysis.
The appearance of the parameters is more explicit in para-
metric analysis. This is often shown by using pZ (z; θ), where
θ = (θ0 , θ1 , . . . , θP −1 ) is the set of parameters defining the
models. For example, in the case of the multivariate normal
pdf model (see Sec. A.6), θ may include the vector of mean
values, or the covariance matrix, or both.
A real-world example of a density estimation problem is
finding groups in data consisting of the luminosity and tem-
perature of some set of stars in our Galaxy. (We are talking
here about the famous Hertzsprung-Russell diagram [50].) A
parametric approach to the above problem could be based on
Introduction 5
Y = qc (X; θ) + E , (1.5)
where
f\ I /'
0.5
e-
?c
u
~ -0.5
\ \ \I \I
0 .1 02 0.3 0 .4 0.5 0 .6 0 .7 0 .8 0 .9
Independent variable
4 :
~
.0
.Ill 2
~
.. \ ..:
·' ... •
~.. 0
a.
~ ·2
.... ~
.- ·-
0 .1 02 0.3 0 .4 0.5 0 .6 0 .7 0 .8 0 .9
Independent variable
a model that offers a very good fit but only to the given
data, namely an overfitted model, additional constraints are
imposed that restrict the choice of parameter values.
We will illustrate the application of SI methods to non-
parametric regression problems through the following exam-
ple.
Spline-based smoothing:
The regression model is Y = f (X) + E, where f (X) is only
assumed to be a smooth, but an otherwise unknown, func-
tion. One way to implement smoothness is to require that the
average squared curvature of f (X), defined as
!2
1 b d2 f
Z
dx , (1.9)
(b − a) a dx2
for x ∈ [a, b] be sufficiently small. It can be shown that the
best least-squares estimate of f (X) under this requirement
must be a cubic spline. (See App. B for a bare-bones review
of splines and the associated terminology.)
Thus, we will assume f (X) to be a cubic spline defined by a
set of M breakpoints denoted by b = (b0 , b1 , . . . , bM −1 ), where
bi+1 > bi . The set of all cubic splines defined by the same b is
a linear vector space. One of the basis sets of this vector space
is that of B-spline functions Bj,4 (x; b), j = 0, 1, ..., M − 1. (It
is assumed here that the cubic splines decay to zero at X = b0
and X = bM −1 .) It follows that f (X) is a linear combination
of the B-splines given by
M
X −1
f (X) = αj Bj,4 (X; b) , (1.10)
j=0
Education Plan
Education
Alternative
Plan Education Plan
Independent variable
3 :
1
~ # •• ,
_,
2 ••. .. • ·.: ••
. , , •• •
·---· • •
•• • • •
: ot,;
: .:.
l
. •_,
.. J .....:.·. ''~..·. ...·'.......:.·
c; • ·-:•• : •• •• • i ·~ • • ,_.• • ~·. • • ; "'.. ' •;, •I ·.- •.?•• • :,. • •' ••~'.": L
~ 0 . • • . . . .... ··~ -. ~ ··· • .· : • • • ·11":.-. •=..·· :;.·.-: :·: .. ,· :-.,
1
i4
¢1
•• •-••·····:' ·••
. • •• • \"'·· ••
... ... ·~· .:·• •• : • ... .. . . .,
. . . . , • .,. •
0 42 I .• • •
Figure 1.2 The top panel shows the function f (X) in Eq. 1.12
that is used for generating test data for the regression spline
method. Here, X is equally spaced with xi = i∆, ∆ = 1/512,
and i = 0, 1, . . . , 511, and f (X) = 10B0,4 (x; c), with c =
(0.3, 0.4, 0.45, 0.5, 0.55). The bottom panel shows a data re-
alization.
Introduction 13
1.5 NOTES
The brevity of the overview of statistical analysis presented in
this chapter demands incompleteness in many important re-
spects. Topics such as the specification of errors in parameter
estimation and obtaining confidence intervals instead of point
estimates are some of the important omissions. Also omitted
are some of the important results in the theoretical founda-
tions of statistics, such as the Cramer-Rao lower bound on
estimation errors and the RBLS theorem.
While the overview, despite these omissions, is adequate
for the rest of the book, the reader will greatly benefit from
perusing textbooks such as [26, 25, 15] for a firmer foundation
in statistical analysis. Much of the material in this chapter is
culled from these books. Non-parametric smoothing methods,
including spline smoothing, are covered in detail in [20].
Some important topics that were touched upon only briefly
in this chapter are elaborated upon a little more below.
that the model generalizes well to new data. Since the typical
applications of machine learning, such as object recognition in
images or the recognition of spoken words, involve inherently
complicated models for the joint pdf, a large amount of train-
ing data is required for successful learning. This requirement
has dovetailed nicely with the emergence of big data, which
has led to the availability of massive amounts of training data
related to real world tasks. It is no surprise, therefore, that
the field of machine learning has seen breathtaking advances
recently.
Deeper discussions of the overlap between statistical and
machine learning concepts can be found in [15, 19].
CHAPTER 2
Stochastic
Optimization Theory
CONTENTS
2.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Convex and non-convex optimization problems 21
2.3 Stochastic optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Exploration and exploitation . . . . . . . . . . . . . . . . . . . . 28
2.5 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.7 BMR strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.8 Pseudo-random numbers and stochastic
optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.9 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
19
20 Swarm Intelligence Methods for Statistical Regression
2.1 TERMINOLOGY
The objective in an optimization problem is to find the op-
timum (maximum or minimum) value of a fitness function
f (x), x ∈ D ⊆ RD . The subset D is called the search space
for the optimization problem and D is the dimensionality of
the search space. Alternative terms used in the literature are
objective function and constraint space (or feasible set) for the
fitness function and search space respectively. (Strictly speak-
ing, the above definition is that of a continuous optimization
problem. We do not consider discrete or combinatorial optiz-
imation problems, where x ∈ ZD , in this book.)
The location, denoted as x∗ below, of the optimum is called
the optimizer (maximizer or minimizer depending on the type
of optimization problem). Since finding the maximizer of a
fitness function f (x) is completely equivalent to finding the
minimizer of −f (x), we only consider minimization problems
in the following.
Formally stated, a minimization problem consists of solv-
ing for x∗ such that
Equivalently,
Generalized Griewank:
D D
1 X xi
Y
f (x) = x2i − cos √ +1. (2.6)
4000 i=1 i=1 i
•
~ •
X
~ •
~
~ ~
~
ro
~ ~
over D. Taking all the points above this surface yields a “vol-
ume” in D + 1 dimensional space. If this volume is convex,
then the function surface is said to be convex. The formal def-
inition goes as follows: Take the subset Df of the Cartesian
product set D × R such that for each element (x, y) ∈ Df ,
x ∈ D and y ≥ f (x). If Df is convex, then f (x) is a convex
function. While we skip the formal proof here, it is easy to
convince oneself that the definition of convexity of a function
implies that f ((1 − λ)a + λb) ≤ (1 − λ)f (a) + λf (b). This
inequality is an alternative definition of a convex function.
One of the principal theorems in optimization theory
states that the global minimum of a convex function f (x)
over a convex set D is also a local minimum and that there is
only one local minimum. A non-convex function, on the other
hand, can have multiple local minima in the search space and
the global minimum need not coincide with any of the local
minima. The Rastrigin function is an example of a non-convex
fitness function. The optimization of a non-convex function is
a non-convex optimization problem.
The local minimizer of a convex function can be found
using the method of steepest descent: step along a sequence
of points in the search space such that at each point, the
step to the next point is directed opposite to the gradient of
the fitness function at that point. Since the gradient at the
local minimizer vanishes, the method will terminate once it
hits the local minimizer. Steepest descent is an example of
a deterministic optimization method since the same starting
point and step size always results in the same path to the local
minimizer.
For a non-convex optimization problem, the method of
steepest descent will only take us to some local minimizer.
When the local minima are spaced closely, steepest descent
will not move very far from its starting point before it hits a
local minimizer and terminates. Thus, steepest descent or any
similar deterministic optimization method is not useful for
locating the global minimizer in a non-convex optimization
problem.
24 Swarm Intelligence Methods for Statistical Regression
• Repeat
That is, the probability of the final solution landing in the op-
timality region goes to unity asymptotically with the number
of iterations.
The following conditions are sufficient for the general
stochastic optimization method to converge.
(b) In the randomization step, the trial value v[k] of the vector
random variable V is obtained as follows. First, a trial value
w is drawn for a vector random variable W that has a uniform
joint pdf over D: This means that P (w ∈ S ⊂ D) = µ(S)/µ(D)
for any S ⊂ D, where µ(A) denotes the volume of A. Then,
a local minimization method is started at w. The location
returned by the local minimization is v[k]. (c) The pdf pV from
which v[k] is drawn is determined implicitly by the procedure
in (b) above. (c) The algorithm is initialized by drawing x[0]
from the same joint pdf as that of W.
While guaranteed to converge in the limit of infinite iter-
ations, the above algorithm is not a practical one. First, since
the number of iterations must always be finite in practice, con-
vergence need not happen. Secondly, the computational cost
of evaluating the fitness function is an important considera-
tion in the total number of iterations that can be performed.
This often requires moving the number of iterations towards
as few as possible, a diametrically opposite condition to the
one above for convergence. Shortening the number of itera-
tions necessarily means that an algorithm will not be able to
visit every region of the search space.
28 Swarm Intelligence Methods for Statistical Regression
2.5 BENCHMARKING
Since stochastic optimization methods are neither guaranteed
to converge nor perform equally well across all possible fitness
functions, it is extremely important to test a given stochas-
tic optimization method across a sufficiently diverse suite of
fitness functions to gauge the limits of its performance. This
process is called benchmarking.
Any measure of the performance of a stochastic optimiza-
tion method on a given fitness function is bound to be a ran-
dom variable. Hence, benchmarking and the comparison of
different methods must employ a statistical approach. This
requires running a given method on not only a sufficiently di-
verse set of fitness functions but also running it multiple times
on each one. The initialization of the method and the sequence
of random numbers should be statistically independent across
the runs.
To keep the computational cost of doing the resulting large
number of fitness function evaluations manageable, the fit-
ness functions used for benchmarking must be computation-
ally inexpensive to evaluate. At the same time, these func-
tions should provide enough of a challenge to the optimization
methods if benchmarking is to be useful. In particular, the fit-
ness functions should be extensible to an arbitrary number of
search space dimensions. Several such benchmark fitness func-
tions have been developed in the literature and an extensive
list can be found in Table I of [4]. The Rastrigin and Griewank
functions introduced earlier in Eq. 2.5 and Eq. 2.6 are in fact
two such benchmark fitness functions.
Stochastic Optimization Theory 31
2.6 TUNING
While benchmarking is essential in the study and objective
comparison of stochastic optimization methods, it does not
guarantee, thanks to the NFL theorem, that the best per-
former on benchmark fitness functions will also work well in
the intended application area. At best, benchmarking estab-
lishes the order in which one should investigate the effective-
ness of a set of methods. Having arrived at an adequately per-
forming method, one must undertake further statistical char-
acterization in order to understand and improve its perfor-
mance on the problem of interest. We call this process, which
is carried out with the fitness functions pertaining to the ac-
tual application area, tuning.
To summarize, benchmarking is used for comparing opti-
mization methods or delimiting the class of fitness functions
that a method is good for. It is carried out with fitness func-
tions that are crafted to be computationally cheap but suf-
ficiently challenging. Tuning is the process of improving the
performance of a method on the actual problem of interest and
involves computing the actual fitness function (or surrogates).
A pitfall in tuning is over-tuning the performance of an
optimization method on one specific fitness function. This
is especially important for statistical regression problems be-
cause the fitness function, such as the least squares function in
Eq. 1.4, depends on the data realization and, hence, changes
every time the data realization changes. Thus, over-tuning the
performance for a particular data realization runs the danger
of worsening the performance of the method on other real-
Stochastic Optimization Theory 33
Education Plan
Education Plan
'
10~ L_
' '- ---------------
___ L_ _ _ _~--~-----L----~--~-----L____L __ __ L_ _~
0 0.2 0.4 0.6 0.8 1.2 1.4 1.6 1.8
Function evaluations
2.9 NOTES
The mathematically rigorous approach to optimization theory
is well described in [46]. A comprehensive list of fitness func-
tions commonly used in the benchmarking of stochastic opti-
mization methods is provided in Table I of [4]. Deterministic
optimization methods are described in great detail in [34]. The
curse of dimensionality is discussed in the context of statisti-
cal analysis and machine learning in Ch. 2 of [16] and Ch. 5
of [19] respectively. Exploration and exploitation have been
encountered in many different optimization scenarios. A good
review can be found in [5]. The methodology for comparative
assessment of optimization algorithms (i.e., benchmarking) is
discussed critically in [2], which also includes references for the
BMR strategy. For an insightful study of PRNGs, the reader
may consult the classic text [29] by Knuth.
CHAPTER 3
Evolutionary
Computation and
Swarm Intelligence
CONTENTS
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Evolutionary computation . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Swarm intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1 OVERVIEW
The universe of stochastic optimization methods is vast and
continues to inflate all the time. However, it is possible to dis-
cern some well-established approaches in the literature, called
metaheuristics, that most stochastic optimization methods
can be grouped under. This classification is only useful as
an organizing principle that helps in comprehending the di-
versity of methods. There is a continuous flow of ideas be-
37
38 Swarm Intelligence Methods for Statistical Regression
• Nature-inspired optimization:
3.4 NOTES
Methods under the random sampling metaheuristic are pop-
ular in Bayesian statistical analysis and a good description of
the latter can be found in [18, 38]. Randomized local opti-
mization methods play a key role in the training of artificial
neural networks: See [19]. A comprehensive review of EC and
SI algorithms, along with references to the original papers,
is provided in [14]. The journal IEEE Transactions on Evo-
lutionary Computation provides an excellent portal into the
Evolutionary Computation and Swarm Intelligence 43
Particle Swarm
Optimization
CONTENTS
4.1 Kinematics: Global-best PSO . . . . . . . . . . . . . . . . . . . . 46
4.2 Dynamics: Global-Best PSO . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 Initialization and termination . . . . . . . . . . . . 49
4.2.2 Interpreting the velocity update rule . . . . 49
4.2.3 Importance of limiting particle velocity . 51
4.2.4 Importance of proper randomization . . . . 53
4.2.5 Role of inertia . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.6 Boundary condition . . . . . . . . . . . . . . . . . . . . . . 56
4.3 Kinematics: Local-best PSO . . . . . . . . . . . . . . . . . . . . . 56
4.4 Dynamics: Local-best PSO . . . . . . . . . . . . . . . . . . . . . . 58
4.5 Standardized coordinates . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6 Recommended settings for regression problems . 60
4.7 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.7.1 Additional PSO variants . . . . . . . . . . . . . . . . . 61
4.7.2 Performance example . . . . . . . . . . . . . . . . . . . . 63
45
46 Swarm Intelligence Methods for Statistical Regression
Symbol Description
Npart The number of particles in the swarm.
x(i) [k] Position of the ith particle at the k th iteration.
v (i) [k] Velocity of the ith particle at the k th iteration.
p(i) [k] Best location found by the ith particle (pbest).
g[k] Best location found by the swarm (gbest).
Xz
Alternative
Alternative
I
I
~.·
Xz
altered as follows.
Velocity constriction:
(i) (i)
v (i) [k + 1] = K v i [k] + R1 c1 F c + R2 c2 F s . (4.8)
φ = c1 + c2 > 4 . (4.10)
(i)
some random vector lying in the plane formed by F c and
(i) (i)
F s . If F s does not change in the next update, the particle
(i)
position as well as the new F c,s vectors will continue to lie in
the same plane. Hence, the next updated velocity vector will
(i)
continue to lie in this plane, and so on. In fact, as long as F s
stays the same, all future velocity vectors will only cause the
particle to move in the initial plane. (When D > 3 and the
starting velocity is not zero, the motion will be confined to a
3-dimensional subspace of the D-dimensional space.)
This confinement to a restricted region of the full search
space leads to inefficient exploration and, hence, loss of per-
formance. On the other hand, if one uses the matrices R1
and R2 , the particle moves off the initial plane even when
the starting velocity vector is zero. Fig. 4.4 illustrates the dif-
ference between using scalar and matrix random variables,
clearly showing that proper randomization is the most impor-
tant ingredient of the velocity update equation.
~
.--..
._
0
~
•x
~
~
~
~
•x
~
~
~
0
~
Figure 4.4 A large number of trial outcomes from the probability distribution of v (i) [k + 1] when v (i) [k] =
(0, 0, 0) and the random factors multiplying the social and cognitive forces in the velocity update equation are
(a) scalars (i.e., same for all the velocity components), and (b) matrices (i.e., different random numbers for
different velocity components). The search space in this illustration is 3-dimensional. The particle position
x(i) [k] is the origin and the red (green) line points from the origin to the pbest (gbest). (For this illustration,
we set w = 0.9, c1 = c2 = 1.0.)
Particle Swarm Optimization 55
56 Swarm Intelligence Methods for Statistical Regression
Education Plan
Figure 4.5Graph representation of the ring topology with a
neighborhood size of 3 for a swarm consisting of 12 parti-
cles. Each node in this graph indicates a particle and an edge
connecting two nodes indicates that they belong to the same
neighborhood. Note that every particle in this topology be-
longs to three different neighborhoods (one of which is its
own).
4.7 NOTES
PSO was proposed by Kennedy and Eberhart in [27]. The im-
portant idea of an inertia weight was introduced in [43]. Ve-
locity constriction was obtained from an analytical study of a
simplified PSO in [8]. A text that goes deeper into the theo-
retical foundations of PSO is [7]. Performance under different
boundary conditions is studied in [39].
Particle Swarm Optimization 61
the prescribed values in Table 4.2 except for the linear decay
of inertia. For gbest PSO, the inertia decayed from 0.9 to 0.4
over 500 iterations, while the same drop in inertia happened
over 2000 iterations for lbest PSO. In both cases, the total
number of iterations at termination was 5000 and the inertia
remains constant over the remaining iterations once it reaches
its lowest value.
CHAPTER 5
PSO Applications
CONTENTS
5.1 General remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.1.1 Fitness function . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.1.2 Data simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.1.3 Parametric degeneracy and noise . . . . . . . . 68
5.1.4 PSO variant and parameter settings . . . . . 70
5.2 Parametric regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.1 Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3 Non-parametric regression . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.1 Reparametrization in regression spline . . 78
5.3.2 Results: Fixed number of breakpoints . . . 81
5.3.3 Results: Variable number of breakpoints 82
5.4 Notes and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
65
66 Swarm Intelligence Methods for Statistical Regression
−1
# 12
1 NX
"
SNR = f (xi )2 , (5.4)
σ i=0
N
N
..
.,
Alternative Education Plan
Alternative Education Plan Education Plan Education Plan
" <0"'
Figure 5.1Contour plot of the fitness function for the quadratic chirp regression problem in Sec. 1.3.1. Shown
here is a 2-dimensional cross-section of the 3-dimensional search space. In panel (a), the data, y, is just a
quadratic chirp, with a1 = 100, a2 = 20, a3 = 10, without any added noise. In panel (b), there is noise present
in y along with the quadratic chirp from (a). Areas with white color contain local minima. The location of
the true parameters is marked by ‘+’, while ‘◦’ marks the global minimizer. The global minimizer coincides
with the true parameters in (a) but not in (b). The fitness function is shown on a logarithmic scale because
PSO Applications 69
the local minima are not as prominent as the global minimum on a linear scale.
70 Swarm Intelligence Methods for Statistical Regression
_.., (a)
-00
.''· .·
-120
-1<40
.. .
(b)
-70
_..,
. . ...
.::.. ""
-110
-100
,,;
-"'2.....
-110
..·'
-120
-130 ·,;
-1-iO
-150
Figure 5.2 Scatterplot of f opt and f true for Ntune = 100 re-
alizations of data containing the quadratic chirp. The pan-
els correspond to PSO with (a) Nruns = 2, Niter = 50, and
(b) Nruns = 8, Niter = 1000. The minimal performance con-
dition is satisfied when a point lies above the straight line
representing f opt = f true . For (a), M(2, 50) = 0.07, while
M(8, 1000) = 1.0 for (b). The same data realizations are used
in both (a) and (b). The true signal parameters are a1 = 100,
a2 = 20, a3 = 10 and the SNR is 10.
74 Swarm Intelligence Methods for Statistical Regression
5.2.2 Results
Given that the computational cost of running PSO on the
parametric regression example is quite low, we skip the metric-
based tuning and validation step, leaving it as an exercise for
the reader, and simply adopt the highest settings, Nruns = 8
and Niter = 1000 (see Fig. 5.2). The results obtained with
this setting and presented below serve as a reference for com-
parison when reproducing the performance of PSO on this
example.
For each data realization, the minimizer found by PSO
gives us the estimated values of the true signal parameters in
that realization. Due to the presence of noise, the estimated
value of a parameter is a random variable. Fig. 5.3 shows the
PSO Applications 75
~ 20 20
::!
20
0
() 10 10 10
0 0 0
50 100 150 10 20 30 5 10 15
a1 a2 a3
-0.96418 0.91359 -0.98727
12 12
21
11 11
20
N (') (')
Ol Ol Ol
19 10 10
18 9 9
120
100
BJ 1
80
~
0 60
()
40
1.8 ,-----~-~-~-~~
/\
,----~-~-~-~-~----,
True signal
1.6 - - Card inal spline Estimate
- - Regression spline estimate
1.4
1.2
0.8
0.6
0.4
0.2
-0.2 L__~-~--~-~-~--~-~-~--~-_j
AIC = 2P − 2L
b P (T) . (5.11)
PSO Applications 83
LR
S = LS + λR(α) , (5.13)
2.5 2.5
(a ) (b)
Figure 5.6 The estimated signals from (a) the regularized re-
gression spline method, and (b) cardinal spline fit for 100 data
realizations. The true signal in all the realizations is the one
shown in black. It follows Eq. 1.12 and has SNR = 10. All
the estimated signals are plotted together in gray. Model se-
lection was used in both cases with the number of breakpoints
K ∈ {5, 6, 7, 8, 9}. The regulator gain in the case of the regres-
sion spline method was set at λ = 5.0. For the cardinal spline
fit, λ = 0.
86 Swarm Intelligence Methods for Statistical Regression
80
70
BJ 1
60
50
40
30
~J
10
0
2
k nfl 10
h-.
12
5.4.1 Summary
In this concluding chapter we have shown how a SI method
such as PSO can help open up new possibilities in statisti-
cal analysis. The computational cost of minimizing the fitness
function can become infeasible even for a small search space
dimensionality when a grid-based strategy is used. Tradition-
ally, the only option for statistical analysts in such situations
has been to use methods under the random sampling meta-
heuristic, which are not only computationally expensive and
wasteful in terms of fitness evaluations, but also fairly cum-
bersome to tune. This has prevented many a novice user of
statistical methods from venturing too far from linear mod-
els in their analysis since these lead to convex optimization
problems that are easily solved. However, non-linear models
are becoming increasingly prevalent and unavoidable across
many application areas as the problems being tackled become
progressively more complex.
Using a three dimensional non-linear parametric regression
example, we showed that PSO could easily solve for the best
fit model with very small computational cost (in terms of the
number of fitness evaluations) while admitting a fairly robust
and straightforward tuning process. The example we chose
was deliberately kept simple in many ways in order to let the
reader reproduce it easily. However, PSO has already been
88 Swarm Intelligence Methods for Statistical Regression
Primer on Probability
Theory
CONTENTS
A.1 Random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
A.2 Probability measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
A.3 Joint probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A.4 Continuous random variables . . . . . . . . . . . . . . . . . . . . 94
A.5 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
A.6 Common probability density functions . . . . . . . . . . 98
89
90 Swarm Intelligence Methods for Statistical Regression
• P (S) = 1.
1
The mathematically rigorous approach requires that, before a prob-
ability measure is assigned to subsets, the subsets be chosen to form a
Borel Algebra in order to guarantee that a countable union or intersec-
tion of events is an event and the empty set is also an event.
Probability Theory 91
of X1 ∈ A given X2 ∈ B.
From Eq. A.3 and Eq. A.4, we get
From the joint pdf, one can derive the pdf of any one of the
variables by marginalization.
Z b
P1 ([a, b]) = pX (x)dx = P12 ([a, b] × R) ,
a
Z b Z ∞
= dx dypXY (x, y) ,
a −∞
Z ∞
⇒ pX (x) = dypXY (x, y) . (A.12)
−∞
R
denominator is pY (y) and the numerator is A pXY (x, y)dx,
which gives
pXY (x, y)
Z
P1|2 (A|[y, y + )) = dx (A.14)
A pY (y)
This motivates the definition of the conditional pdf,
pXY (x, y)
pX|Y (x|y) = , (A.15)
pY (y)
R
giving P1|2 (A|[y, y + )) = A dxpX|Y (x|y).
All the above definitions for two continuous random
variables extend easily to N random variables X =
(X0 , X1 , . . . , XN −1 ).
Joint pdf: pX (x) defined by
Z
P12...N (A) = dN xpX (x) . (A.16)
A
A.5 EXPECTATION
With the probabilistic description of a continuous random
variable in hand, we can introduce some additional useful con-
cepts.
Expectation: E[f (X)], where f (x) is a function, defined by
Z ∞
E[f (X)] = dxf (x)pX (x) . (A.19)
−∞
98 Swarm Intelligence Methods for Statistical Regression
1 1
pX (x; µ, σ) = √ exp 2
(x − µ)2 , (A.22)
2πσ 2σ
pdf,
(
1
b−a , a≤x≤b
pX (x; a, b) = . (A.23)
0, otherwise
The normal pdf is common enough that a special symbol,
N (x; µ, σ), is often used to denote it. The symbol used for the
uniform pdf is U (x; a, b). (A random variable with a normal or
uniform pdf is said to be normally or uniformly distributed.)
Two random variables X0 and X1 are said to have a bi-
variate normal pdf if,
1 1
pXY (x, y) = 1/2
exp − kx − µk2 , (A.24)
2π|C| 2
where for any vector z ∈ R2 ,
!
z0
kzk2 = z0 z1 C−1 , (A.25)
z1
Splines
CONTENTS
B.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
B.2 B-spline basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
B.1 DEFINITION
Consider data that is in the form {(bi , yi )}, i = 0, 1, . . . , M −1,
where bi+1 > bi are called breakpoints. A spline, f (x), is a
piece-wise polynomial function defined over the interval x ∈
[b0 , bM −1 ] such that for j = 1, . . . , M − 1,
101
102 Swarm Intelligence Methods for Statistical Regression
For 2 ≤ k 0 ≤ k,
x − τi
Bi,k0 (x; b) = Bi,k0 −1 (x; b) +
τi+k0 −1 − τi
τi+k0 − t
Bi+1,k0 −1 (x; b) . (B.8)
τi+k0 − τi+1
Splines 105
Analytical
Minimization
CONTENTS
C.1 Quadratic chirp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
C.2 Spline-based smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . 108
For x ∈ RN and y ∈ RN .
107
108 Swarm Intelligence Methods for Statistical Regression
1
LS = ky − αBk2 , (C.9)
2σ 2
1 h 2 T T T T
i
= kyk − 2yB α + αBB α , (C.10)
2σ 2
Bji = Bj,4 [i] . (C.11)
Let α
b be the minimizer of LS over α keeping the other param-
eters fixed. Using the standard condition for the extremum,
∂Ls
b = yBT (BBT )−1 .
=0 ⇒ α (C.12)
∂αi αb
Substituting α
b into Eq. C.10 gives the fitness function lS ,
1 h 2 T T −1 T
i
lS = kyk − yB (BB ) By . (C.13)
2σ 2
that is to be minimized over the remaining parameters.
Bibliography
111
112 Bibliography
117
118 Index
Uniform pdf, 49