Image Analysis and Pattern Recognition For Remote Sensing With Algorithms in ENVI/IDL
Image Analysis and Pattern Recognition For Remote Sensing With Algorithms in ENVI/IDL
Image Analysis and Pattern Recognition For Remote Sensing With Algorithms in ENVI/IDL
Contents
1 Images, Arrays and Vectors
1.1 Multispectral satellite images .
1.2 Algebra of vectors and matrices
1.3 Eigenvalues and eigenvectors .
1.4 Finding minima and maxima .
2 Image Statistics
2.1 Random variables . . . .
2.2 The normal distribution
2.3 A special function . . .
2.4 Conditional probabilities
Theorem . . . . . . . . .
2.5 Linear regression . . . .
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . .
. . . . . . .
. . . . . . .
and Bayes
. . . . . . .
. . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
4
6
8
13
. . . . . . . . . . . . . . . . . . . . . . . 13
. . . . . . . . . . . . . . . . . . . . . . . 14
. . . . . . . . . . . . . . . . . . . . . . . 16
. . . . . . . . . . . . . . . . . . . . . . . 17
. . . . . . . . . . . . . . . . . . . . . . . 18
3 Transformations
3.1 Fourier transforms . . . . . . . . . . . . . . .
3.1.1 Discrete Fourier transform . . . . . . .
3.1.2 Discrete Fourier transform of an image
3.2 Wavelets . . . . . . . . . . . . . . . . . . . . .
3.3 Principal components . . . . . . . . . . . . .
3.4 Minimum noise fraction . . . . . . . . . . . .
3.5 Maximum autocorrelation factor (MAF) . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
21
22
23
23
24
25
28
4 Radiometric enhancement
4.1 Lookup tables . . . . . . . . . . . .
4.1.1 Histogram equalization . .
4.1.2 Histogram matching . . . .
4.2 Convolutions . . . . . . . . . . . .
4.2.1 Laplacian of Gaussian filter
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
31
32
32
33
34
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Topographic modelling
39
5.1 RST transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
i
ii
CONTENTS
5.2
5.3
5.4
5.5
5.6
Imaging transformations . . . . . . . . . .
Camera models and RFM approximations
Stereo imaging, elevation models and
orthorectification . . . . . . . . . . . . . .
Slope and aspect . . . . . . . . . . . . . .
Illumination correction . . . . . . . . . . .
6 Image Registration
6.1 Frequency domain registration
6.2 Feature matching . . . . . . . .
6.2.1 Contour detection . . .
6.2.2 Closed contours . . . . .
6.2.3 Chain codes . . . . . . .
6.2.4 Invariant moments . . .
6.2.5 Contour matching . . .
6.2.6 Consistency check . . .
6.3 Re-sampling and warping . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7 Image Sharpening
7.1 HSV fusion . . . . . . . . . . . . .
7.2 Brovey fusion . . . . . . . . . . . .
7.3 PCA fusion . . . . . . . . . . . . .
7.4 Wavelet fusion . . . . . . . . . . .
7.4.1 Discrete wavelet transform
` trous filtering . . . . . .
7.4.2 A
7.5 Quality indices . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . . . . . . . . . . . . . . . 40
. . . . . . . . . . . . . . . . . . . . 41
. . . . . . . . . . . . . . . . . . . . 44
. . . . . . . . . . . . . . . . . . . . 50
. . . . . . . . . . . . . . . . . . . . 51
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8 Change Detection
8.1 Algebraic methods . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Principal components . . . . . . . . . . . . . . . . . . . . . .
8.3 Post-classification comparison . . . . . . . . . . . . . . . . . .
8.4 Multivariate alteration detection . . . . . . . . . . . . . . . .
8.4.1 Canonical correlation analysis . . . . . . . . . . . . . .
8.4.2 Solution by Cholesky factorization . . . . . . . . . . .
8.4.3 Properties of the MAD components . . . . . . . . . .
8.4.4 Covariance of MAD variates with original observations
8.4.5 Scale invariance . . . . . . . . . . . . . . . . . . . . . .
8.4.6 Improving signal to noise . . . . . . . . . . . . . . . .
8.4.7 Decision thresholds . . . . . . . . . . . . . . . . . . . .
8.5 Radiometric normalization . . . . . . . . . . . . . . . . . . . .
9 Unsupervised Classification
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
53
53
55
56
56
56
56
57
57
58
.
.
.
.
.
.
.
61
61
63
63
64
64
65
66
.
.
.
.
.
.
.
.
.
.
.
.
69
69
70
70
71
71
72
73
74
74
75
75
77
79
CONTENTS
iii
9.1
9.2
9.3
9.2.1
K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
9.2.2
Extended K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9.2.3
9.2.4
Fuzzy K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
EM Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.3.1
Simulated annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
9.3.2
Partition density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
9.3.3
9.4
9.5
10 Supervised Classification
93
117
125
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
131
iv
CONTENTS
B.1
B.2
B.3
B.4
B.5
B.6
B.7
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
131
131
137
138
140
141
143
.
.
.
.
.
.
.
.
.
.
151
. 151
. 152
. 155
. 156
. 156
. 157
. 160
. 163
. 164
. 165
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
171
. 171
. 172
. 172
. 173
. 175
. 177
. 177
. 179
. 182
. 184
. 184
. 186
. 187
. 189
. 189
. 190
. 192
. 194
. 196
. 197
CONTENTS
203
vi
CONTENTS
Chapter 1
There are a number of multispectral satellite-based sensors currently in orbit which are used
for earth observation. Representative of these we mention here the Landsat ETM+ system.
The ETM+ instrument on the Landsat 7 spacecraft contains sensors to measure radiance
in three spectral intervals:
visible and near infrared (VNIR) bands - bands 1,2,3,4, and 8 (PAN) with a spectral
range between 0.4 and 1.0 micrometer.
short wavelength infrared (SWIR) bands - bands 5 and 7 with a spectral range between
1.0 and 3.0 micrometer.
thermal long wavelength infrared (LWIR) band - band 6 with a spectral range between
8.0 and 12.0 micrometer.
In addition a panchromatic (PAN) image (band 8) covering the visible spectrum is provided.
Ground resolutions are 15m (PAN), 30m (VNIR,SWIR) and 60m (LWIR). Figure 1.1 shows
a color composite image of a Landsat 7 scene over Morocco acquired in 1999.
A single multispectral image can be represented as an array of gray-scale values or digital
numbers
gk (i, j), 1 i c, 1 j r,
where c is the number of pixel columns and r is the number of pixel rows. If we are dealing
with an N -band multispectral image, then the index k, 1 k N , denotes the spectral
band. Often a pixel intensity is stored in a single byte, so that 0 gk 255.
The gray-scale values are the result of sampling along an array of sensors the at-sensor
radiance f (x, y) at wavelength due to sunlight reflected from some point (x, y) on the
Earths surface and focussed by the satellites optical system at the sensors. Ignoring atmospheric effects this radiance is given roughly by
f (x, y) i (x, y)r (x, y),
where i (x, y) is the suns irradiance at the surface in units of watt/m2 m, and r (x, y)
is the surface reflectance, a number between 0 and 1. The conversion between gray-scale
1
Figure 1.1: Color composite of bands 4 (red), 5 (green) and 7 (blue) for a Landsat ETM+
image over Morocco.
g2 (1, 1)
g2 (1, 2)
g2 (1, 3)
g1 (2, 1)
g1 (2, 2)
g1 (2, 3)
g2 (2, 1)
g2 (2, 2)
g2 (2, 3)
g1 (3, 1)
g1 (3, 2)
g1 (3, 3)
g2 (3, 1)
g2 (3, 2)
g2 (3, 3),
g1 (2, 1)
g1 (2, 2)
g2 (2, 3)
g1 (3, 1)
g1 (3, 2)
g1 (3, 3)
g2 (1, 1)
g2 (2, 1)
g2 (3, 1)
g2 (2, 1)
g1 (2, 2)
g1 (2, 3)
g2 (3, 1)
g2 (2, 3)
g2 (3, 3),
g1 (2, 1)
g1 (2, 2)
g1 (2, 3)
g2 (2, 1)
g2 (2, 2)
g2 (2, 3)
g1 (3, 1)
g1 (3, 2)
g1 (3, 3)
g2 (3, 1)
g2 (3, 2)
g2 (3, 3).
In the computer language IDL, so-called row major indexing is used for arrays and the
elements in an array are numbered from zero. This means that, if a gray-scale image g is
stored in an IDL array variable G, then the intensity value g(i, j) is addressed as G[i-1,j-1].
An N -band multispectral image is stored in BIP format as an N c r array in IDL, in
BIL format as a c N r and in BSQ format as an c r N array.
Auxiliary information, such as image acquisition parameters and georeferencing, is normally included with the image data on the same file, and the format may or may not make
use of compression algorithms. Examples are the geoTIFF1 file format used for example by
Space Imaging Inc. for distributing Carterra(c) imagery and which includes lossless compression, the HDF (Hierachical Data Format) in which for example ASTER images are distributed
and the cross-platform PCDSK format employed by PCI Geomatics with its image processing software, which is in plain ASCII code and not compressed. ENVI uses a simple flat
binary file structure with an additional ASCII header file.
1 geoTIFF refers to TIFF files which have geographic (or cartographic) data embedded as tags within the
TIFF file. The geographic data can then be used to position the image in the correct location and geometry
on the screen of a geographic information display.
1.2
g1 (i, j)
..
g(i, j) =
(1.1)
,
.
gN (i, j)
which is a column vector of multispectral gray-scale values at the position (i, j).
Since we will be making extensive use of the vector notation of Eq. (1.1) we review
here some of the basic properties of vectors and matrices. We can illustrate most of these
properties in just two dimensions.
x2
x1
Figure 1.2: A vector in two dimensions.
y1
y2
= x1 y1 + x2 y2 .
q
x21 + x22 =
x> x .
The programming language IDL is especially good at manipulating vectors and matrices:
IDL> x=[[1],[2]]
IDL> print,x
1
2
IDL> print,transpose(x)
1
2
>
x cos
The inner product can be written in terms of the vector lengths and the angle between
the two vectors as
x> y = |x||y| cos = xy cos ,
see Fig. 1.3. If = 90o the vectors are orthogonal so that
x> y = 0.
Any vector can be decomposed into orthogonal unit vectors:
x1
1
0
x=
= x1
+ x2
.
0
1
x2
A two-by-two matrix is written
A=
a11
a21
a12
a22
.
When a matrix is multiplied with a vector the result is another vector, e.g.
a11 a12
x1
a11 x1 + a12 x2
Ax =
=
.
a21 a22
x2
a21 x1 + a22 x2
The IDL operator for matrix and vector multiplication is ##.
IDL> a=[[1,2],[3,4]]
IDL> print,a
1
2
3
4
IDL> print,a##x
5
11
Matrices also have a transposed form, obtained by interchanging their rows and columns:
a11 a21
>
A =
.
a12 a22
The product of two matrices is given by
a11 a12
b11
AB =
a21 a22
b21
b12
b22
=
y2
0
=
x1 y 1
x2 y 1
x1 y2
x2 y2
1
0
0
1
,
IA = AI = A.
1
|A|
a22
a21
a12
a11
.
1.3
The statistical properties of ensembles of pixel intensities (for example entire images or
specific land-cover classes) are often approximated by their mean values and covariance
matrices. As we will see later, covariance matrices are always symmetric. A matrix A is
symmetric if it doesnt change when it is transposed, i.e. if
A = A> .
Very often we have to solve the so-called eigenvalue problem, which is to find eigenvectors x
and eigenvalues that satisfy the equation
Ax = x
or, equivalently,
a11
a21
a12
a22
x1
x2
=
x1
x2
.
(1.2)
which is known as the characteristic equation for the eigenvalue problem. It is a quadratic
equation in with solutions
q
1
(1) =
a11 + a22 + (a11 + a22 )2 4(a11 a22 a212 )
2
(1.3)
q
1
(2) =
a11 + a22 (a11 + a22 )2 4(a11 a22 a212 ) .
2
Thus there are two eigenvalues and, correspondingly, two eigenvectors x(1) and x(2) , which
can be obtained by substituting (1) and (2) into (1.2) and solving for x1 and x2 . It is easy
to show that the eigenvalues are orthogonal
(x(1) )> x(2) = 0.
The matrix formed by the two eigenvectors,
u = (x
(1)
(2)
,x
)=
u> Au =
0
(1)
x1
(1)
x2
0
(2)
(2)
x1
(2)
x2
,
(1.4)
IDL> a=float([[1,2],[2,3]])
IDL> print,a
1.00000
2.00000
2.00000
3.00000
IDL> print,eigenql(a,eigenvectors=u,/double)
4.2360680
-0.23606798
IDL> print,transpose(u)##a##u
4.2360680 -2.2204460e-016
-1.6653345e-016
-0.23606798
Note that, after diagonalization, the off-diagonal elements are not precisely zero due to
rounding errors in the computation.
All of the above properties generalize easily to N dimensions.
1.4
1
0
=
+
.
0 x1
1 x2
x
Many of the operations with vector derivatives correspond exactly to operations with ordinary scalar derivatives (They can all be verified easily by writing out the expressions
component-by component):
>
(x y) = y
x
analogous to
xy = y
x
>
(x x) = 2x
x
analogous to
2
x = 2x
x
x> Ay,
>
(x Ax) = Ax + A> x.
x
Note that, if A is a symmetrix matrix, this last equation can be written
>
(x Ax) = 2Ax.
x
Suppose x is a critical point of the function f (x), i.e.
d
d
f (x ) = f (x)
= 0,
dx
d
x=x
(1.5)
f (x)
x
d
dx f (x )
=0
x
d2
dx2 f (x )
d
d2
f (x ) + (x x )2 2 f (x ) + . . . .
dx
dx
d2
f (x ).
dx2
f (x ) 1
+ (x x )> H(x x ).
x
2
(1.6)
2
f (x ).
xi xj
f (x )
x
(1.7)
for all x 6= 0.
(1.8)
Suppose we want to find a minimum (or maximum) of a scalar function f (x) of the
vector x. If there are no constraints, then we solve the set of equations
f (x)
= 0,
xi
i = 1, 2,
10
(f (x) + g(x)) = 0,
xi
(f (x) + g(x)) = 0.
i = 1, 2
(1.10)
(f (x) + g(x)) = x1 + x2 1 = 0
The solution is
b
a
, x2 =
.
x1 =
a+b
a+b
11
Exercises
1. Show that the outer product of two 2-dimensional vectors is a singular matrix.
2. Prove that the eigenvectors or a 2 2 symmetric matrix are orthogonal.
3. Differentiate the function
1
(x a y)
with respect to y.
4. Verify the following matrix identity in IDL:
(A B)> = B> A> .
5. Calculate the eigenvalues and eigenvectors of a non-symmetric matrix with IDL.
6. Plot the function f (x) = x21 x22 with IDL. Find its minima and maxima subject to
the constraint g(x) = x21 + x22 1 = 0.
12
Chapter 2
Image Statistics
It is useful to think of image pixel intensities g(x) as realizations of a random vector G(x)
drawn independently from some probability distribution.
2.1
Random variables
A random variable can be used to represent some quantity which changes in an unpredictable
way each time it is observed. If there is a discrete set of M possible events {Ei }, i = 1 . . . M ,
associated with some random process, let pi be the probability that the ith event Ei will
occur. If ni represents the number of times Ei occurs in n trials, we expect that pi ni /n
in the limit n and that
M
X
pi = 1.
i=1
i = 1 . . . 36.
14
For continuous random variables, such as the measured radiance at a satellite sensor, the
distribution function is not expressed in terms of discrete probabilities, but rather in terms
of a probability density function p(x), where p(x)dx is the probability that the value of the
random variable X lies in the interval [x, x + dx]. Then
Z x
P (x) = Pr(X x) =
p(t)dt
and, of course,
P () = 1.
P () = 0,
The variance of X, written var(X) is defined as the expected value of the random variable
(X hXi)2 , i.e.
var(X) = (X hXi)2 .
In terms of the probability density function, it is given by
Z
var(X) =
(x hXi)2 p(x)dx.
Two simple but very useful identities follow from the definition of variance:
var(X) = hX 2 i hXi2
var(aX) = a2 var(X).
2.2
(2.1)
It is often the case that random variables are well-described by the normal or Gaussian
probability density function
1
1
exp( 2 (x )2 ).
2
2
p(x) =
In that case
hXi = ,
var(X) = 2 .
.
hGN (x)i
15
where x denotes the pixel coordinates, i.e. x = (i, j), is estimated by averaging over all of
the pixels in the image,
c,r
1 X
hG(x)i
g(i, j),
cr i,j=1
referred to as the sample mean vector. It is usually assumed to be independent of x, i.e.
hG(x)i = hGi.
The covariance between bands k and ` is defined according to
cov(Gk , G` ) = h(Gk hGk i)(G` hG` i)i
and is estimated again by averaging over the pixels:
cov(Gk , G` )
c,r
1 X
(gk (i, j) hGk i)(g` (i, j) hG` i),
cr i,j=1
which is called the sample covariance. The covariance is also usually assumed to be independent of x. The variance for bands k is given by
var(Gk ) = cov(Gk , Gk ) = (Gk hGk i)2 .
The random vector G is often assumed to be described by a multivariate normal probability density function p(g), given by
1
1
> 1
p
exp (g ) (g ) .
p(g) =
2
(2)N/2 ||
We indicate this by writing
G N (, ).
The distribution function of the multi-spectral pixels is then completely determined by the
expected value hGi = and by the covariance matrix . In two dimensions, for example,
2
cov(G1 , G2 )
1 12
var(G1 )
=
=
.
cov(G2 , G1 )
var(G2 )
21 22
Note that, since cov(Gk , G` ) = cov(G` , Gk ), the covariance matrix is symmetric, = > .
The covariance matrix can also be written as an outer product:
= h(G hGi)(G hGi)> i.
as can its estimated value:
c,r
1 X
(g(i, j) hGi)(g(i, j) hGi)> .
cr i,j=1
= hGG> i.
Another useful identity applies to any linear combination a> G of the random vector G,
namely
var(a> G) = a> a.
(2.2)
16
cov(G1 ,G2 )
1
12
1 12
1
var(G1 )var(G2 )
1 2
C=
=
.
=
21
21 1
1
cov(G2 ,G1 )
1
1 2
var(G1 )var(G2 )
The following ENVI/IDL program calculates and prints out the covariance matrix of a
multispectral image:
envi_select, title=Choose multispectral image,fid=fid,dims=dims,pos=pos
if (fid eq -1) then return
num_cols = dims[2]-dims[1]+1
num_rows = dims[4]-dims[3]+1
num_pixels = (num_cols*num_rows)
num_bands = n_elements(pos)
samples=intarr(num_bands,n_elements(num_pixels))
for i=0,num_bands-1 do samples[i,*]=envi_get_data(fid=fid,dims=dims,pos=pos[i])
print, correlate(samples,/covariance,/double)
end
ENVI> .GO
111.46663
82.123236
159.58377
133.80637
82.123236
64.532431
124.84815
104.45298
205.63420
159.58377
124.84815
246.18004
133.80637
104.45298
205.63420
192.70367
2.3
A special function
1! = 0! = 1.
THEOREM
17
P (a, ) = 1.
2.4
If A and B are two events such that the probability of A andB occurring simultaneously is
P (A, B), then the conditional probability of A occuring given that B has occurred is
P (A | B) =
P (A, B)
.
P (B)
18
Bayes Theorem (named after Rev. Thomas Bayes, an 18th century mathematician who
derived a special case) is the basic starting point for inference problems using probability
theory as logic. We will use it in the following form. Let X be a random variable describing
a pixel intensity, and let {Ck | k = 1 . . . M } be a set of possible classes for the pixels. Then
the a posteriori conditional probability for class Ck , given the measured pixel intensity x is
P (Ck |x) =
P (x|Ck )P (Ck )
,
P (x)
(2.3)
where
P (Ck ) is the prior-probability for class Ck ,
P (x|Ck ) is the conditional probability of observing the value x, if it belongs to class Ck ,
PM
P (x) = k=1 p(x|Ck )p(Ck ) is the total probability for x.
2.5
Linear regression
Applying radiometric corrections to digital images often involves fitting a set of m data
points (xi , yi ) to a straight line:
y(x) = a + bx + .
Suppose that the measurements yi include a random error with variance 2 and that the
measurements xi are exact. Define a goodness of fit function
2
m
X
yi a bxi
2
(a, b) =
.
(2.4)
i=1
If the random variable is normally distributed, then we obtain the most likely (i.e. best)
values for a and b by minimizing this function, that is, by solving the equations
2
2
=
= 0.
a
b
The solution is
b = sxy ,
s2xx
where
a
= y b
x,
sxy =
1 X
(xi x
)(yi y)
m i=1
s2xx =
1 X
(xi x
)2
m i=1
(2.5)
1 X
xi ,
m i=1
m
x
=
1 X
yi .
m i=1
m
y =
(2.6)
19
2 =
20
Exercises
1. Write the multivariate normal probability density function p(g) for the case = 2 I.
Show that probability density function for a one-dimensional random variable G is a
special case. Prove that hGi = .
2. In the Monty Hall game a contestant is asked to choose between one of three doors.
Behind one of the doors is an automobile as prize for choosing the correct door. After
the contestant has chosen, Monty Hall opens one of the other two doors to show that
the automobile is not there. He then asks the contestant if she wishes to change her
mind and choose the other unopened door. Use Bayes theorem to prove that her
correct answer is yes.
3. Derive the uncertainty for a in (2.6) from the formula for error propagation
a2
N
X
i=1
f
yi
2
.
Chapter 3
Transformations
Up until now we have thought of multispectral images as (r c N )-dimensional arrays
of measured pixel intensities. In the present chapter we consider other representations of
images which are often useful in image analysis.
3.1
Fourier transforms
Figure 3.1: Fourier series approximation of a sawtooth function. The series was truncated
at k = 4. The left hand side shows the intensities |
x(k)|2 .
A periodic function x(t) with period T ,
x(t) = x(t + T )
can always be expressed as the infinite Fourier series
x(t) =
x
(k)ei2(kf )t ,
(3.1)
k=
where f = 1/T = /2 and eix = cos x + i sin x. From the orthogonality of the e-functions,
the coefficients x
(k) in the expansion are given by
Z 1/2f
x
(k) = f
x(t)ei2(kf )t dt.
(3.2)
1/2f
21
22
CHAPTER 3. TRANSFORMATIONS
Figure 3.1 shows an example for the sawtooth function with period T = 1:
x(t) = t, 1/2 t < 1/2.
Parsevals formula follows directly from (3.2)
Z
X
|
x(k)|2 = f
k
3.1.1
1/2f
(x(t))2 dt.
1/2f
Let g(j) be a discrete sample of the real function g(x) (a row of pixels), sampled c times at
the sampling interval over a complete period T , i.e.
g(j) = g(x = j),
j = 0 . . . c 1.
c/2
1 X
g(k)ei2(kf )(j) , j = 0 . . . c 1,
c
(3.3)
k=c/2
where the truncation frequency 2c f is the highest frequency component that can be determined by the sampling. This frequency is called the Nyquist critical frequency and is given
by 1/2, so that f is determined by
cf
1
=
2
2
or
f=
1
.
c
(This corresponds to sampling over one complete period: c = T .) Thus (3.3) becomes
c/2
1 X
g(k)ei2kj/c ,
g(j) =
c
j = 0 . . . c 1.
k=c/2
c/21
1 X
g(k)ei2kj/c ,
c
j = 0 . . . c 1,
k=c/2
c/21
1
1 X
1 X
g(k)e2kj/c +
g(k)ei2kj/c
c
c
k=0
k=c/2
c/21
c1
0
1 X
1 X
g(k)ei2kj/c +
X(k 0 c)ei2(k c)j/c
c
c 0
c1
1 X
1 X
g(k)ei2kj/c +
g(k c)ei2kj/c .
c
c
k=0
k =c/2
c/21
k=0
k=c/2
3.2. WAVELETS
23
1X
g(k)ei2kj/c ,
c
c1
g(j) =
j = 0 . . . c 1,
(3.4)
k=0
c1
X
g(j)ei2kj/c ,
k = 0 . . . c 1.
(3.5)
j=0
(3.6)
j=0
Eq. (3.4) itself is the discrete inverse Fourier transform. The discrete analog of Parsivals
formula is
c1
c1
X
1X
|
g (k)|2 =
g(j)2 .
(3.7)
c j=0
k=0
Determining the frequency components in (3.5) would appear to involve, in all, c2 floating
point multiplication operations. The fast Fourier transform (FFT) exploits the structure of
the complex e-functions to reduce this to order c log c, see for example [PFTV86].
3.1.2
The discrete Fourier transform is easily generalized to two dimensions for the purpose of
image analysis. Let g(i, j), i, j = 0 . . . c 1, represent a (quadratic) gray scale image. Its
discrete Fourier transform is
g(k, `) =
c1 X
c1
X
g(i, j)ei2(ik+j`)/c
(3.8)
i=0 j=0
c1 c1
1 XX
g(k, `)ei2(ik+j`)/c .
c2
(3.9)
k=0 `=0
3.2
Wavelets
Unlike the Fourier transform, which represents a signal (array of pixel intensities) in terms
of pure frequency functions, the wavelet transform expresses the signal in terms of functions
which are restricted both in terms of frequency and spatial extent. In many applications,
this turns out to be particularly efficient and useful. Well see an example of this in Chapter
7, where we discuss image fusion in more detail. The wavelet transform is discussed in
Appendix B.
24
3.3
CHAPTER 3. TRANSFORMATIONS
Principal components
AA> = I,
and let the the transformed principal component vector be Y = A> G with covariance matrix
0 . Then we have
0 = hYY> i = hA> GG> Ai
1
0
= A> A = Diag(1 . . . N ) =
...
0
2
..
.
..
.
0
0
=: .
..
.
N
The fraction of the total variance in the original multispectral image which is described by
the first i principal components is
1 + . . . + i
.
1 + . . . + i + . . . + N
If the original multispectral channels are highly correlated, as is usually the case, the first
few principal components will account for a very high percentage of the variance the image.
For example, a color composite of the first 3 principal components of a LANDSAT TM
scene displays essentially all of the information contained in the 6 spectral components in
one single image. Nevertheless, because of the approximation involved in the assumption
of a normal distribution, higher order principal components may also contain significant
information [JRR99].
The principal components transformation can be performed directly from the ENVI main
menu. However the following IDL program illustrates the procedure in detail:
; Principal components analysis
envi_select, title=Choose multispectral image, $
25
fid=fid, dims=dims,pos=pos
if (fid eq -1) then return
num_cols = dims[2]+1
num_lines = dims[4]+1
num_pixels = (num_cols*num_lines)
num_channels = n_elements(pos)
image=intarr(num_channels,num_pixels)
for i=0,num_channels-1 do begin
temp=envi_get_data(fid=fid,dims=dims,pos=pos[i])
m = mean(temp)
image[i,*]=temp-m
endfor
; calculate the transformation matrix A
sigma = correlate(image,/covariance,/double)
lambda = eigenql(sigma,eigenvectors=A,/double)
print,Covariance matrix
print, sigma
print,Eigenvalues
print, lambda
print,Eigenvectors
print, A
; transform the image
image = image##transpose(A)
; reform to BSQ format
PC_array = bytarr(num_cols,num_lines,num_channels)
for i = 0,num_channels-1 do PC_array[*,*,i] = $
reform(image[i,*],num_cols,num_lines,/overwrite)
; output the result to memory
envi_enter_data, PC_array
end
3.4
Principal components analysis maximizes variance. This doesnt always lead to images of
decreasing image quality (i.e. of increasing noise). The MNF transformation minimizes the
noise content rather than maximizing variance, so, if this is the desired criterion, it is to be
preferred over PCA.
Suppose we can represent a gray scale image G with covariance matrix and zero mean
as a sum of uncorrelated signal and noise noise components
G = S + N,
26
CHAPTER 3. TRANSFORMATIONS
both normally distributed, with covariance matrices S and N and zero mean. Then we
have
= hGG> i = h(S + N)(S + N)> i = hSS> i + hNN> i,
since noise and signal are uncorrelated, i.e. hSN> i = hNS> i = 0. Thus
= S + N .
(3.10)
Now let us seek a linear combination a> G for which the signal to noise ratio
SNR =
var(a> S)
a > S a
= >
>
var(a N)
a N a
a> a
1.
a> N a
(3.11)
Differentiating we get
1
1
a> a 1
SNR = >
a >
N a = 0,
a
a N a 2
(a N a)2 2
or, equivalently,
(3.12)
Both N and are symmetric and the latter is also positive definite. Its Cholesky factorization is
= LL> ,
where L is a lower triangular matrix, and can be thought of as the square root of . Such
an L always exists is is positive definite. With this, we can write (3.12) as
N a = LL> a
or, equivalently,
a>
i ai
>
ai (i ai )
1=
1
1.
i
Thus the eigenvector ai corresponding to the smallest eigenvalue i will maximize the signal
to noise ratio. Note that (3.12) can be written in the form
N A = A,
(3.13)
27
X = N
where
1/2
G,
1/2
N N
= I.
X = N
1/2
(3.14)
B> X B = X ,
B> B = I.
(3.15)
Y = B> N
G = A> G
1/2
N A = N N
=
=
=
1/2
N X B1
X
1/2 1/2
1/2
N N N B1
X
A1
.
X
1
= SNRi + 1.
i
Thus an eigenvalue in the second transformation equal to one corresponds to pure noise.
Before the transformation can be performed, it is of course necessary to estimate the
noise covariance matrix N . This can be done for example by differencing with respect to
the local mean:
(N )k`
c,r
1 X
(gk (i, j) mk (i, j))(g` (i, j) m` (i, j))
cr i,j
where mk (i, j) is the local mean of pixels in some neighborhood of (i, j).
28
3.5
CHAPTER 3. TRANSFORMATIONS
Let x represent the coordinates of a pixel within image G, i.e. x = (i, j). We consider the
covariance matrix between the original image, represented by G(x), and the same image
G(x + ) shifted by an amount = (x , y )> :
() = hG(x)G(x + )> i,
assumed to be independent of x. Then
(0) = ,
and furthermore
() = hG(x)G(x )> i
= hG(x + )G(x)> i
= h(G(x)G(x + )> )> i
= ()> .
Now we consider the covariance of projections of the original and shifted images:
cov(a> G(x), a> G(x + )) = a> hG(x)G(x + )> ia
= a> ()a
= a> ()a
1
= a> (() + ())a.
2
(3.16)
Define as the covariance matrix of the difference image G(x) G(x + ), i.e.
= h(G(x) G(x + ))(G(x) G(x + )> i
= hG(x)G(x)> i + hG(x + )G(x + )> i hG(x)G(x + )> i
hG(x + )G(x)> i
= 2 () ().
Hence () + () = 2 and we can write (3.16) in the form
1
cov(a> G(x), a> G(x + )) = a> a a> a.
2
The correlation of the projections is therefore given by
a> a 12 a> a
a> a 12 a> a
= p
(a> a)(a> a)
=1
(3.17)
1 a> a
.
2 a> a
or
29
1
1
R
a> a 1
= >
a >
a = 0
a
a a 2
(a a)2 2
(a> a) a = (a> a)a.
(3.18)
which is seen to have the same form as (3.12). Again both and are symmetric and
the latter is also positive definite and we obtain the standard eigenproblem
[L1 (L1 )> ]b = b,
for the real, symmetric matrix L1 (L1 )> .
Let the eigenvalues be 1 . . . N and the corresponding (orthogonal) eigenvectors be
bi . We have
>
>
>
i 6= j,
0 = b>
i bj = ai LL aj = ai aj ,
and therefore
>
>
cov(a>
i G(x), aj G(x)) = ai aj = 0,
i 6= j,
so that the MAF components are orthogonal (uncorrelated). Moreover with equation (2.14)
we have
1
>
corr(a>
i G(x), ai G(x + )) = 1 i ,
2
and the first MAF component has minimum autocorrelation.
An ENVI plug-in for performing the MAF transformation is given in Appendix D.5.2.
30
CHAPTER 3. TRANSFORMATIONS
Exercises
1. Show that, for x(t) = sin(2t) in Eq. (2.2),
x
(1) =
1
,
2i
x
(1) =
1
,
2i
and x
(k) = 0 otherwise.
2. Calculate the discrete Fourier transform of the sequence 2, 4, 6, 8 from (3.4). You have
to solve four simultaneous equations, the first of which is
2=
1
g(0) + g(1) + g(2) + g(3) .
4
Chapter 4
Radiometric enhancement
4.1
Lookup tables
Figure 4.1: Contrast enhancement with a lookup table represented as the continuous function
f (x) [JRR99].
Intensity enhancement of an image is easily accomplished by means of lookup tables. For
byte-encoded data, the pixel intensities g are used to index an array
LU T [k],
k = 0 . . . 255,
the entries of which also lie between 0 and 255. These entries can be chosen to implement
linear stretching, saturation, histogram equalization, etc. according to
gk (i, j) = LU T [gk (i, j)],
0 i r 1, 0 j c 1.
31
32
It is also useful to think of the the lookup table as an approximately continuous function
y = f (x).
If hin (x) is the histogram of the original image and hout (y) is the histogram of the image
after transformation through the lookup table, then, since the number of pixels is constant,
hout (y) dy = hin (x) dx,
see Fig.4.1
4.1.1
Histogram equalization
and
y = f (x)
hin (t)dt.
0
The lookup table y for histogram equalization is thus proportional to the cumulative sum
of the original histogram.
4.1.2
Histogram matching
4.2. CONVOLUTIONS
33
are combined in a mosaic. We can do this by first equalizing both the input histogram
hin (x) and the reference histogram href (y) with the cumulative lookup tables z = f (x) and
z = g(y), respectively. The required lookup table is then
y = g 1 (z) = g 1 (f (x)).
The necessary steps for implementing this function are illustrated in Fig. 1.5 taken from
[JRR99].
4.2
Convolutions
c1
X
g(j)eij .
(4.1)
j=0
(4.2)
where the sum is over all nonzero elements of the filter h. If the number of nonzero elements
is finite, we speak of a finite impulse response filter (FIR).
Theorem 1 (Convolution theorem) In the frequency domain, convolution is replaced by
f (j)eij =
h()
g () =
j,k
X
k
ik
h(k)e
h(k)g(j k)eij
!
i`
g(`)e
h(k)g(`)ei(k+`)
k,`
k,j
This can of course be generalized to two dimensional images, so that there are three
basic steps involved in image filtering:
1. The image and the convolution filter are transformed from the spatial domain to the
frequency domain using the FFT.
2. The transformed image is multiplied with the frequency filter.
3. The filtered image is transformed back to the spatial domain.
34
We often distinguish between low-pass and high-pass filters. Low pass filters perform
some sort of averaging. The simplest example is
h = (1/2, 1/2, 0 . . .),
which computes the average of two consecutive pixels. A high-pass filter computes differences
of nearby pixels, e.g.
h = (1/2, 1/2, 0 . . .).
Figure 4.3 shows the Fourier transforms of these two simple filters generated by the the IDL
program
; Hi-Lo pass filters
x = fltarr(64)
x[0]=0.5
x[1]=-0.5
p1 =abs(FFT(x))
x[1]=0.5
p2 =abs(FFT(x))
envi_plot_data,lindgen(64),[[p1],[p2]]
end
Figure 4.3: Low-pass(red) and high-pass (white) filters in the frequency domain. The quan2
tity |h(k)|
is plotted as a function of k. The highest frequency is at the center of the plots,
k = c/2 = 32 .
4.2.1
We shall illustrate image filtering with the so-called Laplacian of Gaussian (LoG) filter,
which will be used in Chapter 6 to implement contour matching for automatic determination
of ground control points. To begin with, consider the gradient operator for a two-dimensional
image:
=
=i
+j
,
x
x1
x2
4.2. CONVOLUTIONS
35
where i and j are unit vectors in the vertical and horizontal directions, respectively. g(x)
is a vector in the direction of the maximum rate of change of gray scale intensity. Since the
intensity values are discrete, the partial derivatives must be approximated. For example we
can use the Sobel operators:
g(x)
[g(i 1, j 1) + 2g(i, j 1) + g(i + 1, j 1)]
x1
[g(i 1, j + 1) + 2g(i, j + 1) + g(i + 1, j + 1)] = 2 (i, j)
g(x)
[g(i 1, j 1) + 2g(i 1, j) + g(i 1, j + 1)]
x2
[g(i + 1, j 1) + 2g(i + 1, j) + g(i + 1, j + 1)] = 1 (i, j)
which are equivalent to the two-dimensional FIR filters
1
h1 = 2
1
0
0
0
1
2
1
1
and h2 = 0
1
2
0
2
1
0 ,
1
36
Now consider the second derivatives of the image intensities, which can be represented
formally by the Laplacian
2
2
+ 2.
2 = > =
2
x1
x2
2 g(x) is a scalar quantity which is zero whenever the gradient is maximum. Therefore
changes in intensity from dark to light or vice versa correspond to sign changes in the
Laplacian and these can also be used for edge detection. The Laplacian can also be approximated by a FIR filter, however such filters tend to be very sensitive to image noise.
Usually a low-pass Gauss filter is first used to smooth the image before the Laplacian filter
is applied. It is more efficient, however, to calculate the Laplacian of the Gauss function
itself and then use the resulting function to derive a high-pass filter. The Gauss function in
two dimensions is given by
1
1
exp 2 (x21 + x22 ),
2
2
2
where the parameter determines its extent. Its Laplacian is
1
1
2
2
2
2
2
(x
+
x
2
)
exp
(x
+
x
)
2
2
2 6 1
2 2 1
a plot of which is shown in Fig. 4.4.
The following program illustrates the application of the filter to a gray scale image, see
Fig. 4.5:
pro LoG
sigma = 2.0
filter = fltarr(17,17)
for i=0L,16 do for j=0L,16 do $
filter[i,j] = (1/(2*!pi*sigma^6))*((i-8)^2+(j-8)^2-2*sigma^2) $
*exp(-((i-8)^2+(j-8)^2)/(2*sigma^2))
; output as EPS file
thisDevice =!D.Name
set_plot, PS
Device, Filename=c:\temp\LoG.eps,xsize=4,ysize=4,/inches,/Encapsulated
shade_surf,filter
device,/close_file
set_plot, thisDevice
; read a jpg image
filename = Dialog_Pickfile(Filter=*.jpg,/Read)
OK = Query_JPEG(filename,fileinfo)
if not OK then return
xsize = fileinfo.dimensions[0]
ysize = fileinfo.dimensions[1]
window,11,xsize=xsize,ysize=ysize
Read_JPEG,filename,image1
image = bytarr(xsize,ysize)
4.2. CONVOLUTIONS
37
image[*,*] = image1[0,*,*]
tvscl,image
; run the filter
filt = image*0.0
filt[0:16,0:16]=filter[*,*]
image1= float(fft(fft(image)*fft(filt),1))
; get zero-crossings and display
image2 = bytarr(xsize,ysize)
indices = where( (image1*shift(image1,1,0) lt 0) or (image1*shift(image1,0,1) lt 0) )
image2[indices]=255
wset, 11
tv, image2
end
38
Chapter 5
Topographic modelling
Satellite images are two-dimensional representations of the three-dimensional earth surface.
The correct treatment of the third dimension the elevation is essential for terrain modelling and accurate georeferencing.
5.1
RST transformation
(5.1)
X = X + X0
Y = Y + Y0
Z = Z + Z0
1
0
T=
0
0
a uniform scaling by 50% to
1/2
0
S=
0
0
1 The
0 0
1 0
0 1
0 0
0
1/2
0
0
X0
Y0
,
Z0
1
0
0
1/2
0
0
0
,
0
1
39
40
cos
sin
R =
0
0
sin
cos
0
0
0 0
0 0
,
1 0
0 1
(5.2)
5.2
Imaging transformations
y
Y = ( Z).
X=
(5.4)
41
Thus, in order to extract the geographical coordinates (X, Y ) of a point on the earths
surface from its image coordinates, we require knowledge of the elevation Z. Correcting for
the elevation in this way constitutes the process of orthorectification.
5.3
Equation (5.3) is overly simplified, as it assumes that the origin of world and image coordinates coincide. In order to apply it, one has first to transform the image coordinate system
from the satellite to the world coordinate system. This is done in a straightforward way
with the rotation and translation transformations introduced in Section 5.1. However it
requires accurate knowledge of the height and orientation of the satellite imaging system at
the time of the image acquisition (or, more exactly, during the acquisition, since the latter
is normally not instantaneous). The resulting non-linear equations that relate image and
world coordinates are what constitute the camera or sensor model for that particular image.
Direct use of the camera model for image processing is complicated as it requires extremely exact, sometimes proprietary information about the sensor system and its orbit.
An alternative exists if the image provider also supplies a so-called rational function model
(RFM) which approximates the camera model for each acquisition as a ratio of rational
polynomials, see e.g. [TH01]. Such RFMs have the form
a(X 0 , Y 0 , Z 0 )
b(X 0 , Y 0 , Z 0 )
c(X 0 , Y 0 , Z 0 )
c0 = g(X 0 , Y 0 , Z 0 ) =
d(X 0 , Y 0 , Z 0 )
r0 = f (X 0 , Y 0 , Z 0 ) =
(5.5)
where c0 and r0 are the column and row (XY) coordinates in the image plane relative to an
origin (c0 , r0 ) and scaled by a factor cs resp. rs :
c0 =
c c0
,
cs
r0 =
r r0
.
rs
X X0
,
Xs
Y0 =
Y Y0
,
Ys
Z0 =
Z Z0
.
Zs
The polynomials a, b, c and d are typically to third order in the world coordinates, e.g.
a(X, Y, Z) = a0 + a1 X + a2 Y + a3 Z + a4 XY + a5 XZ + a6 Y Z + a7 X 2 + a8 Y 2 + a9 Z 2
+ a10 XY Z + a11 X 3 + a12 XY 2 + a13 XZ 2 + a14 X 2 Y + a15 Y 3 + a16 Y Z 2
+ a17 X 2 Z + a18 Y 2 Z + a19 Z 3
The advantage of using ratios of polynomials is that these are less subject to interpolation
error.
For a given acquisition the provider fits the RFM to his camera model using a threedimensional grid of points covering the image and world spaces with a least squares fitting
procedure. The RFM is capable of representing the camera model extremely well and can
be used as a replacement for it. Both Space Imaging and Digital Globe provide RFMs with
their high resolution IKONOS and QuickBird imagery. Below is a sample Quickbird RFM
file giving the origins, scaling factors and polynomial coefficients needed in Eq. (5.5).
42
satId = "QB02";
bandId = "P";
SpecId = "RPC00B";
BEGIN_GROUP = IMAGE
errBias =
56.01;
errRand =
0.12;
lineOffset = 4683;
sampOffset = 4154;
latOffset =
32.5709;
51.8391;
longOffset =
heightOffset = 1582;
lineScale = 4733;
sampScale = 4399;
latScale =
0.0256;
longScale =
0.0269;
heightScale = 500;
lineNumCoef = (
+1.162844E-03,
-7.011681E-03,
-9.993482E-01,
-1.119999E-02,
-6.682911E-06,
+7.591306E-05,
+3.632740E-04,
-1.111298E-04,
-5.842086E-04,
+2.212466E-06,
-1.275349E-06,
+1.279061E-06,
+1.918762E-08,
-6.957548E-07,
-1.240783E-06,
-7.644403E-07,
+3.479752E-07,
+1.259300E-05,
+1.085128E-06,
-1.571375E-06);
lineDenCoef = (
+1.000000E+00,
+1.801541E-06,
+5.822024E-04,
+3.774278E-04,
-2.141015E-08,
-6.984359E-07,
-1.344888E-06,
-9.669251E-07,
-4.726988E-08,
+1.329814E-06,
+2.113403E-08,
-2.914653E-06,
43
44
END_GROUP = IMAGE
END;
To illustrate a simple use of the RFM data, consider a vertical structure in a highresolution image, such as a chimney or building fassade. Suppose we determine the image
coordinates of the bottom and top of the structure to be (rb , cb ) and (rt , ct ), respectively.
Then from 5.5
rb = f (X, Y, Zb )
cb = g(X, Y, Zb )
rt = f (X, Y, Zt )
(5.6)
ct = g(X, Y, Zt ),
since the (X, Y ) coordinates must be the same. This would appear to constitute a set of
four equations in four unknowns X, Y , Zb and Zt , however the solution is unstable because
of the close similarity of Zt to Zb . Nevertheless the object height Zt Zb can be obtained
by the following procedure:
1. Get (rb , cb ) and (rt , ct ) from the image.
2. Solve first two equations in (5.6) (e.g. with Newtons method) for X and Y with Zb
set equal to the average elevation in the scene if no DEM is available, otherwise to the
true elevation.
3. For a spanning range of Zt0 values, calculate (rt0 , c0t ) from the second two equations in
(5.6) and choose for Zt the value of Zt0 which gives closest agreement to the values
read in.
Quite generally, the RFM can approximate the camera model very well and can be used
as an alternative for providing end users with the necessary information to perform their
own photogrammetric processing. An ENVI plug-in for object height determination
from RFM data is given in Appendix D.2.1.
5.4
The missing elevation information Z in (5.3) or in (5.5) can be obtained with stereoscopic
imaging techniques. Figure 5.2 shows two cameras viewing the same world point w from
two positions. The separation of the lens centers is the baseline. The objective is to find
the coordinates (X, Y, Z) of w if its image points have coordinates (x1 , y1 ) and (x2 , y2 ). We
assume that the cameras are identical and that their image coordinate systems are perfectly
aligned, differing only in the location of their origins. The Z coordinate of w is the same for
both coordinate systems.
In Figure 5.3 the first camera is brought into coincidence with the world coordinate
system. Then from (5.4),
x1
X1 =
( Z).
Alternatively, if the second camera is brought to the origin of the world coordinate system,
x2
X2 =
( Z).
46
B
.
x2 x1
(5.7)
Thus if the displacement of the image coordinates of the point w, namely x2 x1 can be
determined, the Z coordinate can be calculated. The task is then to find two corresponding points in different images of the same scene. This is usually accomplished by spatial
correlation techniques and is closely related to the problem of image-to-image registration
discussed in the next chapter.
48
pro test_correl_images
height = 705.0
base = 370.0
pixel_size = 15.0
envi_select, title=Choose 1st image, fid=fid1, dims=dims1, pos=pos1, /band_only
envi_select, title=Choose 2nd image, fid=fid2, dims=dims2, pos=pos2, /band_only
im1 = envi_get_data(fid=fid1,dims=dims1,pos=pos1)
im2 = envi_get_data(fid=fid2,dims=dims2,pos=pos2)
n_cols = dims1[2]-dims1[1]+1
n_rows = dims1[4]-dims1[3]+1
parallax = fltarr(n_cols,n_rows)
progressbar = Obj_New(progressbar, Color=blue, Text=0,$
title=Cross correlation, column ...,xsize=250,ysize=20)
progressbar->start
for i=7L,n_cols-8 do begin
if progressbar->CheckCancel() then begin
envi_enter_data,pixel_size*parallax*(height/base)
progressbar->Destroy
return
endif
progressbar->Update,(i*100)/n_cols,text=strtrim(i,2)
for j=25L,n_rows-26 do begin
cim = correl_images(im1[i-5:i+5,j-5:j+5],im2[i-7:i+7,j-25:j+25], $
xoffset_b=0,yoffset_b=-20,xshift=0,yshift=20)
corrmat_analyze,cim,xoff,yoff,m,e,p
parallax[i,j] = yoff > (-5.0)
endfor
endfor
progressbar->destroy
envi_enter_data,pixel_size*parallax*(height/base)
end
This program makes use of the routines correl images and corrmat analyze from the IDL
Astronomy Users Library2 to calculate the cross-correlation of the two images. For each
pixel in the nadir image an 11 11 window is moved along an 11 51 window in the backlooking image centered at the same position. The point of maximum correlation defines the
parallax or displacement p. This is related to the relative elevation e of the pixel according
to
h
e = p 15m,
b
where h is the height of the sensor and b is the baseline, see Figure 5.7.
Figure 5.8 shows the result. Clearly there are many problems due to the correlation
errors, however the relative elevations are approximately correct when compared to the
DEM determined with the ENVI commercial add-on AsterDTM, see Figure 5.9.
2 www.astro.washington.edu/deutsch/idl/htmlhelp/index.html
back camera
nadir camera
satellite motion
e
p
ground
Figure 5.7: Relating parallax p to elevation e by similar triangles: e/p = (h e)/b h/b.
50
5.5
Terrain analysis involves the processing of elevation data. Specifically we consider here
the generation of slope images, which give the steepness of the terrain at each pixel, and
aspect images, which give the prevailing direction relative to north of a vector normal to the
landscape at each pixel.
A 33 pixel window can be used to determine both slope and aspect, see Figure 5.10.
Define
x1 = c a y1 = a g
x2 = f d
y2 = b h
x3 = i g
y3 = c i
and
x = (x1 + x2 + x3 )/(3xs )
y = (y1 + y2 + y3 )/(3xs ,
where xs , ys give the pixel dimensions in meters. Then the slope in % at the central pixel
position is given by
p
(x)2 + (y )2
s=
100
2
whereas the aspect in radians measured clockwise from north is
x
= tan1
.
y
51
Slope/aspect determinations from a DEM are available in the ENVI main menu under
Topographic/Topographic Modelling.
5.6
Illumination correction
Figure 5.11: Angles involved in computation of local solar elevation, taken from [RCSA03].
Topographic modelling can be used to correct images for the effects of local solar illumination, which depends not only upon the suns position (elevation and azimuth) but also
upon the local slope and aspect of the terrain being illuminated. Figure 5.11 shows the
angles involved [RCSA03]. Solar elevation is i , solar azimuth is a , p is the slope and 0
is the aspect. The quantity to be calculated is the local solar elevation i which determines
52
(5.8)
Figure 5.12: Cosine of local solar illumination angle stretched across a DEM.
Let T represent the reflectance of the inclined surface in Figure 5.11. Then for a
Lambertian surface, i.e. a surface which scatters reflected radiation uniformly in al directions,
the reflectance of the corresponding horizontal surface H would be
H = T
cos i
.
cos i
(5.9)
The Lambertian assumption is in general not correct, the actual reflectance being described by a complicated bidirectional reflectance distribution function (BRDF). An empirical appraoch which gives a better approximation to the BRDF is the C-correction [TGG82].
Let m and b be the slope and intercept of a regression line for reflectance vs. cos i for a
particular image band. Then instead of (5.9) one uses
cosi + b/m
H = T
.
(5.10)
cos i + b/m
An ENVI plug-in for illumination correction with the C-correction approximation is given in Appendix D.2.2.
Chapter 6
Image Registration
Image registration, either to another image or to a map, is a fundamental task in image
processing. It is required for georeferencing, stereo imaging, accurate change detection, or
any kind of multitemporal image analysis.
Image-to-image registration methods can be divided into roughly four classes [RC96]:
1. algorithms that use pixel values directly, i.e. correlation methods
2. frequency- or wavelet-domain methods that use e.g. the fast fourier transform(FFT)
3. feature-based methods that use low-level features such as edges and corners
4. algorithms that use high level features and the relations between them, e.g. objectoriented methods
We consider examples of frequency-domain and feature-based methods here.
6.1
Consider two N N gray scale images g1 (i0 , j 0 ) and g2 (i, j), where g2 is offset relative to g1
by an integer number of pixels:
g2 (i, j) = g1 (i0 , j 0 ) = g1 (i i0 , j j0 ),
i0 , j0 N.
(6.1)
54
55
6.2
Feature matching
A tedious task associated with image-image registration using low level image features is
the setting of ground control points (GCPs) since, in general, it is necessary to resort to
the manual entry. However various techniques for automatic determination of GCPs have
been suggested in the literature. We will discuss one such method, namely contour matching
[LMM95]. This technique has been found to function reliably in bitemporal scenes in which
vegetation changes do not dominate. It can of course be augmented (or replaced) by other
automatic methods or by manual determination. The procedures involved in image-image
registration using contour matching are shown in Fig. 6.2 [LMM95].
Image 1
LoG
Zero Crossing
Image 2
Edge Strength
Contour
Finder
Chain Code
Encoder
?
Image 2
(registered)
Warping
Consistency
Check
Closed Contour
Matching
56
6.2.1
Contour detection
The first step involves the application of a Laplacian of Gaussian filter to both images. After
determining the contours by examining zero-crossings of the LoG-filtered image, the contour
strengths are encoded in the pixel intensities. Strengths are taken to be proportional to the
magnitude of the gradient at the zero-crossing.
6.2.2
Closed contours
In the next step, all closed contours with strengths above some given threshold are determined by tracing the contours. Pixels which have been visited during tracing are set to zero
so that they will not be visited again.
6.2.3
Chain codes
For subsequent matching purposes, all significant closed contours found in the preceding
step are chain encoded. Any digital curve can be represented by an integer sequence
{a1 , a2 . . . ai . . .}, ai {0, 1, 2, 3, 4, 5, 6, 7}, depending on the relative position of the current
pixel with respect to the previous pixel in the curve. This simple code has the drawback
that some contours produce wrap around. For example the line in the direction 22.5o has
the chain code {707070 . . .}. Li et al. [LMM95] suggest the smoothing operation:
{a1 a2 . . . an } {b1 b2 . . . bn },
where b1 = a1 and bi = qi , qi is an integer satisfying (qi ai ) mod 8 = 0 and |qi bi1 | min,
i = 2, 3 . . . n.
They also suggest the applying the Gaussian smoothing filter {0.1, 0.2, 0.4, 0.2, 0.1} to the
result. Two chain codes can be compared by sliding one over the other and determining
the maximum correlation between them.
6.2.4
Invariant moments
The closed contours are first matched according to their invariant moments. These are
defined as follows, see [Hab95, GW02]. Let the set C denote the set of pixels defining a
contour, with |C| = n, that is, n is the number of pixels on the contour. The moment of
order p, q of the contour is defined as
X
mpq =
j p iq .
(6.2)
i,jC
m10
,
m00
yc =
m01
.
m00
X
i,jC
(j xc )p (i yc )q ,
(6.3)
57
1
(p+q)/2+1
00
pq .
(6.4)
1
1 X
=
(j yc )2 .
20
200
n2
i,jC
The normalized centralized moments are, apart from effects of digital quantization, invariant
under scale changes and translations of the contours.
Finally, we can define moments which are also invariant under rotations, see [Hu62]. The
first two such invariant moments are
h1 = 20 + 02
2
h2 = (20 02 )2 + 411
.
(6.5)
For example, consider a general rotation of the coordinate axes with origin at the center of
gravity of a contour:
0
j
cos
sin
j
j
=
=
A
.
i0
sin cos
i
i
The first invariant moment in the rotated coordinate system is
0
1 X 02
1 X 0 0
j
02
(j + i ) = 2
(j , i ) 0
h1 = 2
i
n 0 0
n 0 0
i ,j C
i ,j C
1 X
j
(j, i)A> A
= 2
i
n
i,jC
1 X 2
= 2
(j + i2 ),
n
i,jC
since A> A = I.
6.2.5
Contour matching
Each significant contour in one image is first matched with contours in the second image
according to their invariant moments h1 , h2 . This is done by setting a threshold on the
allowed differences, for instance 1 standard deviation. If one or more matches is found, the
best candidate for a GCP pair is then chosen to be that matched contour in the second
image for which the chain code correlation with the contour in the first image is maximum.
If the maximum correlation is less that some threshold, e.g. 0.9, then no match is found.
The actual GCP coordinates are taken to be the centers of gravity of the matched contours.
6.2.6
Consistency check
The contour matching procedure invariably generates false GCP pairs, so a further processing step is required. In [LMM95] use is made of the fact that distances are preserved under
a rigid transformation. Let A1 A2 represent the distance between two points A1 and A2 in
58
an image. For two sets of m matched contour centers {Ai } and {Bi } in image 1 and 2, the
ratios
Ai Aj /Bi Bj , i = 1 . . . m, j = i + 1 . . . m,
are calculated. These should form a cluster, so that pairs scattered away from the cluster
center can be rejected as false matches.
An ENVI plug-in for GCP determination via contour matching is given in
Appendix D.3.
6.3
We represent with (x, y) the coordinates of a point in image 1 and the corresponding point
in image 2 with (u, v). A second order polynomial map of image 2 to image 1, for example,
is given by
u = a0 + a1 x + a2 y + a3 xy + a4 x2 + a5 y 2
v = b0 + b1 x + b2 y + b3 xy + b4 x2 + b5 y 2 .
Since there are 12 unknown coefficients, we require at least 6 GCP pairs to determine the
map (each pair generates 2 equations). If more than 6 pairs are available, the coefficients can
be found by least squares fitting. This has the advantage that an RMS error for the mapping
can be estimated. Similar considerations apply for lower or higher order polynomial maps.
Having determined the map coefficients, image 2 can be registered to image 1 by resampling. Nearest neighbor resampling simply chooses the actual pixel in image 2 that has
its center nearest the calculated coordinates (u, v) and transfers it to location (x, y). This
is the preferred technique for classification or change detection, since the registered image
consists of the original pixel brightnesses, simply rearranged in position to give a correct
image geometry. Other commonly used resampling methods are bilinear interpolation and
cubic convolution interpolation, see [JRR99] for details. These methods mix the spectral
intensities of neighboring pixels.
59
Exercises
1. We can approximate the centralized moments (6.3) of a contour by the integral
Z Z
pq =
(x xx )p (y yc )q f (x, y)dxdy,
where the integration is over the whole image and where f (x, y) = 1 if the point
(x, y) lies on the contour and f (x, y) = 0 otherwise. Use this approximation to prove
that the normalized centralized moments pq given in (3.4) are invariant under scaling
transformations of the form
0
x
0
x
=
.
y0
0
y
60
Chapter 7
Image Sharpening
The change detection and classification algorithms that we will meet in the next chapters
exploit of course not only the spatial but also the spectral information of satellite imagery.
Many common platforms (Landsat 7 TM, IKONOS, SPOT, QuickBird) offer panchromatic
images with higher ground resolution than that of the spectral channels. Application of multispectral change detection or classification methods is therefore restricted to the lower resolution. Conventional image fusion techniques, such as the well-known HSV-transformation
can be used to sharpen the spectral components, however the effect of mixing-in of the
panchromatic image is often to dilute the spectral resolution. Another disadvantage of
the HSV transformation is that one is restricted to using three of the available spectral
channels. In the following we will outline the HSV method and then consider alternative
fusion techniques.
7.1
HSV fusion
In computers with 24-bit graphics (true color), any three channels of a multispectral image
can be displayed with 8 bits for each of the additive primary colors red, green and blue. The
monitor displays this as an RGB color composite image which, depending on the choice of
image channels and their relative intensities, may or may not appear to be natural. There
are 224 16 million colors possible.
Another means of color definition is in terms of hue, saturation and value (HSV). Value
(or intensity) can be thought of as an axis equidistant from the three orthogonal primary
color axes. Hue refers to the actual color and is defined as an angle on a circle perpendicular
to the value axis. Saturation is the amount of color present and is represented by the
radius of the circle described by the hue,
A commonly used method for fusion of two images (for example a lower resolution multispectral image with a higher resolution panchromatic image) is to transform the first image
from RGB to HSV space, replace the V component with the grayscale values of the second
image after performing a radiometric normalization, and then transform back to RGB space.
The forward transformation begins by rotating the RGB coordinate axes into the diagonal
61
62
axis of the RGB color cube. The coordinates in the new reference system are given by
m1
2/ 6
m2 = 0
i1
1/ 3
1/ 6 1/6
R
1/2 1/ 2 G .
1/ 3
1/ 3
B
Then the the rectangular coordinates (m1 , m2 , i1 ) are transformed into the cylindrical HSV
coordinates:
q
7.2
63
Brovey fusion
In its simplest form this method multiplies each re-sampled multispectral pixel by the ratio
of the corresponding panchromatic pixel intensity to the sum of all of the multispectral
intensities. The corrected pixel intensities gk (i, j) in the kth fused multispectral channel are
given by
gp (i, j)
,
0
k0 gk (i, j)
gk (i, j) = gk (i, j) P
(7.1)
where gk (i, j) is the (re-sampled) pixel intensity in the kth channel and gp (i, j) is the corresponding pixel intensity in the panchromatic image. (The ENVI-environment offers Brovey
fusion in its main menu.) This technique assumes that the spectral range spanned by the
panchromatic image is essentially the same as that covered by the multispectral channels.
This is seldom the case. Moreover, to avoid bias, the intensities used should be the radiances
at the satellite sensors, implying use of the sensors calibration.
7.3
PCA fusion
Panchromatic sharpening using principal components analysis (PCA) is similar to the HSV
method. After the PCA transformation, the first principal component is replaced by the
panchromatic image, again after radiometric normalization, see Figure 7.1.
64
7.4
Wavelet fusion
Wavelets provide an efficient means of representing high and low frequency components of
multispectral images and can be used to perform image sharpening. Two examples are given
here.
7.4.1
-
- C H (i, j)
k+1
-
- C V (i, j)
k+1
-
- C D (i, j)
k+1
- G
- - H
- G
Figure 7.2: Wavelet filter bank. H is a low-pass and G a high-pass filter derived from the
coefficients of the wavelet transformation. The symbol indicates downsampling by a factor
of 2. The original image gk (i, j) can be reconstructed by inverting the filter.
bz = mzms az mzpan ,
(7.2)
where mz and z denote mean and standard deviation, respectively. These coefficients are
then used to normalize the wavelet coefficients for the panchromatic image to those of the
multispectral image:
Ciz (i, j) az Ciz (i, j) + bz ,
z = H, V, D, i = 2, 3.
(7.3)
65
The degraded panchromatic image g3 (i, j) is then replaced by the each of the four multispectral images and the normalized wavelet coefficients are used to reconstruct the original 1m
resolution. We thus obtain what would be seen if the multispectral sensors had the resolution
of the panchromatic sensor [RW00].
An ENVI plug-in for panchromatic sharpening with the DWT is given in
Appendix D.4.1.
7.4.2
` trous filtering
A
The radiometric fidelity obtained with the discrete wavelet transform is excellent, as will be
shown in the next section. However the lack of translational invariance of the DWT often
leads to spatial artifacts (blurring, shadowing, staircase effect) in the sharpened product.
This is illustrated in the following program, in which an image is transformed once with the
DWT and the low-pass quadrant shifted by one pixel relative to the high-pass quadrants
(i.e. the wavelet coefficients). After inverting the transformation, serious degradation is
apparent, see Figure 7.3.
pro translate_wavelet
; get an image band
envi_select, title=Select input file, $
fid=fid, dims=dims, pos=pos, /band_only
if fid eq -1 then return
; create a DWT object
aDWT = Obj_New(DWT,envi_get_data(fid=fid,dims=dims,pos=pos))
; compress
aDWT->compress
; shift the compressed portion supressing phase correlation match
aDWT->inject,shift(aDWT->Get_Quadrant(0),[1,1]),pc=0
; restore
aDWT->expand
; return result to ENVI
envi_enter_data, aDWT->get_image()
end
As an alternative to the DWT, the `
a trous wavelet transform (ATWT) has been proposed
for image sharpening [AABG02]. The ATWT is a multiresolution decomposition defined
formally by a low-pass filter H = {h(0), h(1), . . .} and a high-pass filter G = H, where
denotes an all-pass filter. Thus the high frequency part is just the difference between the
original image and low-pass filtered image. Not surprisingly, this transformation does not
allow perfect reconstruction if the output is downsampled. Therefore downsampling is not
performed at all. Rather, at the kth iteration of the low-pass filter, 2k1 zeroes are inserted
between the elements of H. This means that every other pixel is interpolated on the first
iteration:
H = {h(0), 0, h(1), 0, . . .},
while on the second iteration
H = {h(0), 0, 0, h(1), 0, 0, . . .}
etc. (hence the name `
a trous = with holes). The low-pass filter is usually chosen to be
symmetric (unlike the Daubechies wavelet filters for example). The prototype filter chosen
66
7.5
Quality indices
Wang and Bovik [WB02] suggest the following measure of radiometric fidelity between two
image bands f and g:
67
6
6
Pan
G
MS
MS(sharpened)
6
normalize
?
insert
Figure 7.5: Comparison of three image sharpening methods with the Wang-Bovik quality
index. Left to right: Gram-Schmidt, ATWT, DTW.
68
Q=
f g
2fg
2f g
4f g fg
2
=
f g f + g2 f2 + g2
(f2 + g2 )(f2 + g2 )
(7.4)
where f and f are mean and variance of band f and f g is the covariance of the two
bands. This first term in (7.4) is seen to be the correlation coefficient between the two
images, with values in [1, 1], the second term compares their average brightness, with
values in [0, 1] and the third term compares their contrasts, also in [0, 1]. Thus perfect
radiometric correspondence would give a value Q = 1.
Since image quality is usually not spatially invariant, it is usual to compute Q in, say,
M sliding windows and then average over all such windows:
Q=
M
1 X
Qj .
M j=1
An ENVI plug-in for determining the quality index for pansharpened images is
given in Appendix D.4.3.
Figure 7.5 shows a comparison of three image sharpening methods applied to a QuickBird
image, namely the Gram-Schmidt, ATWT and DWT transformations. The latter is by far
the best, but spatial artifacts are apparent.
Chapter 8
Change Detection
To quote Singhs review article on change detection [Sin89],
The basic premise in using remote sensing data for change detection is that
changes in land cover must result in changes in radiance values ... [which] must
be large with respect to radiance changes from other factors.
In the present chapter we will mention briefly the most commonly used digital techniques for
enhancing this change signal in bitemporal satellite images, and then focus our attention
on the so-called multivariate alteration detection algorithm of Nielsen et al. [NCS98].
8.1
Algebraic methods
In order to see changes in the two multispectral images represented by N -dimensional random vectors F and G, a simple procedure is to subtract them from each other componentby-component, examining the N differenced images characterized by
F G = (F1 G1 , F2 G2 . . . FN GN )>
(8.1)
for significant changes. Pixel intensity differences near zero indicate no change, large positive
or negative values indicate change, and decision thresholds can be set to define significant
changes. If the difference signatures in the spectral channels are used to classify the kind of
change that has taken place, one speaks of change vector analysis. Thresholds are usually
expressed in standard deviations from the mean difference value, which is taken to correspond
to no change.
Alternatively, ratios of intensities of the form
Fk
,
Gk
k = 1...N
(8.2)
can be built between successive images. Ratios near unity correspond to no-change, while
small and large values indicate change. A disadvantage of this method is that random
variables of the form (8.2) are not normally distributed, so simple threshold values defined
in terms of standard deviations are not valid.
Other algebraic combinations, such as differences in vegetation indices (Section 2.1) are
also in use. All of these band math operations can of course be performed conveniently
within the ENVI/IDL environment.
69
70
8.2
Principal components
8.3
Post-classification comparison
If two co-registered satellite images have been classified, then the class labels can be compared to determine land cover changes. If classification is carried out at the pixel level (as
opposed to segments or objects), then classification errors (typically > 5%) may dominate
the true changes, depending on the magnitude of the latter. ENVI offers functions for
statistical analysis of post-classification change detection.
8.4
71
Suppose we make a linear combination of the intensities for all N channels in the first image
acquired at time t2 , represented by the random vector F. That is, we create a single image
whose pixel intensities are
U = a > F = a 1 F 1 + a 2 F2 + . . . aN F N ,
where the vector of coefficients a is as yet unspecified. We do the same for t2 , i.e. we make
the linear combination V = b> G, and then look at the scalar difference image U V . This
procedure combines all the information into a single image, whereby one still hast to choose
the coefficients a and b in some suitable way. Nielsen et al. [NCS98] suggest determining
the coefficients so that the positive correlation between U and V is minimized. This means
that the resulting difference image U V will show maximum spread in its pixel intensities.
If we assume that the spread is primarily due to actual changes that have taken place in the
scene over the interval t2 t1 , then this procedure will enhance those changes as much as
possible.
Specifically we seek linear combinations such that
var(U V ) = var(U ) + var(V ) 2cov(U, V ) maximum,
(8.3)
(8.4)
var(U V ) = 2(1 ),
(8.5)
cov(U, V )
var(U )var(V )
Since we are dealing with change detection, we require that the random variables U and V
be positively correlated, that is,
cov(U, V ) > 0.
We thus seek vectors a and b which minimize the positive correlation .
8.4.1
var(V ) = b> gg b,
cov(U, V ) = a> f g b.
72
If we introduce the Lagrange multipliers /2 and /2, extremalizing the covariance cov(U, V )
under the constraints (8.4) is equivalent to extremalizing the unconstrained Lagrange function
L
= f g b 2f f a = 0,
a
2
or
a=
1 1
f g b,
ff
= f g a 2gg b = 0
b
2
b=
1 1
f g a.
gg
= p
var(U )var(V )
=p
a > f g b
.
a> f f a b> gg b
a> f g 1
gg gf a
,
>
a f f a
2 =
b> gf 1
f f f g b
b> gg b
(8.6)
Thus the desired projections U = a> F are given by the eigenvectors a1 . . . aN corresponding
to the generalized eigenvalues
2 1 . . . N
>
of f g 1
gg gf with respect to f f . Similarly the desired projections V = b G are given
1
by the eigenvectors b1 . . . bN of gf f f f g with respect to gg corresponding to the same
eigenvalues. Nielsen et al. [NCS98] refer to the N difference components
Mi = Ui Vi = ai > F bi > G, i = 1 . . . N,
(8.7)
8.4.2
73
8.4.3
i 6= j.
(8.8)
Furthermore
1
bi = 1
gg gf ai ,
i
i.e. substituting this into the LHS of the second equation in (8.6):
p
1
1
1 1
gf 1
i ai = i gg bi ,
f f f g gg gf ai = gf f f i f f ai = gf
i
i
as required. It follows that
p
p
> 1
f g 1
j a >
j ij ,
a>
gg gf aj =
i f g bj = ai p
i f f ai =
j
and similarly for b>
i gf aj . Thus the covariances of the MAD components are given by
p
>
>
>
cov(Ui Vi , Uj Vj ) = cov(a>
j ).
i F bi G, aj F bj G) = 2ij (1
The MAD components are therefore orthogonal (uncorrelated) with variances
p
2
i ).
var(Ui Vi ) = M
ADi = 2(1
(8.9)
The transformation corresponding to the smallest eigenvalue, namely (aN , bN ), will thus
give maximal variance for the difference U V .
We can derive change probabilities from a MAD image as follows. The sum of the squares
of the standardized MAD components for no-change pixels, given by
2
2
M ADN
M AD1
Z=
+ ... +
,
M AD1
M ADN
is approximately chi-square distributed with N degrees of freedom, i.e.,
P r(Z z) = P (N/2, z/2).
For a given measured value z for some pixel, the probability that Z could be that large or
larger, given that the pixel is no-change, is
1 P (N/2, z/2).
74
The probability that the pixel is a change pixel is therefore the complement of this,
Pchange (z) = 1 (1 P (N/2, z/2)) = P (N/2, z/2).
(8.10)
This quantity can be plotted for example as a gray scale image to show the regions of change.
The last MAD component has maximum spread in its pixel intensities and, ideally,
maximum change information. However, depending on the type of change one is looking for,
the other components may also be extremely useful. The second-to-last image has maximum
spread subject to the condition that the pixel intensities are statistically uncorrelated with
those in the first image, and so on. Since interesting anthropomorphic changes will generally
be uncorrelated with dominating seasonal vegetation changes or stochastic image noise, it is
quite common that such changes will be concentrated in higher order components. This in
fact is one of the nicest aspects of the method it sorts different categories of change into
different image components. Therefore we can also perform change vector analysis on the
MAD change vector.
An ENVI plug-in for MAD is given in Appendix D.5.1.
8.4.4
8.4.5
Scale invariance
An additional advantage of the MAD procedure stems from the fact that the calculations
involved are invariant under linear transformations of the original image intensities. This
implies that the method is insensitive to differences in atmospheric conditions or sensor
calibrations at the two acquisition times. We can see this as follows. Suppose the second
image G is transformed according to some linear transformation T,
H = TG.
The relevant covariance matrices for (8.6) are then
0f g = hFH> i = f g T>
0gf = hHF> i = Tgf
0f f = f f
0gg = hHH> i = Tgg T> .
The eigenproblems are therefore
f g T> (Tgg T> )1 Tgf a = 2 f f a
>
2
>
Tgf 1
f f f g T c = Tgg T c,
75
which are identical to (8.6) with b = T> c. Therefore the MAD components in the transformed situation are
>
>
>
>
>
>
>
>
a>
i F ci H = ai F ci TG = ai F (T ci ) G = ai F bi G
as before.
8.4.6
8.4.7
Decision thresholds
Since the MAD components are approximately normally distributed about zero and uncorrelated, see Figure 8.2, decision thresholds for change or no change pixels can be set in terms
of standard deviations about the mean for each component separately. This can be done
arbitrarily, for example by saying that all pixels in a MAD component whose intensities are
within 2M AD are no-change pixels.
(8.11)
76
SC ,
SU = S\SN C SC SC+ ,
SC+ ,
with SU denoting the set of ambiguous pixels.1 From the sample mean and sample variance,
we estimate initially the moments for the distribution of no-change pixels:
N C =
(N C )2 =
1
|SN C |
1
|SN C |
xi ,
iSN C
(xi N C )2
iSN C
(|S| denotes set cardinality) and similarly for C and C+. Bruzzone and Prieto [BP00]
suggest improving these estimates by using the pixels in SU and applying the so-called EM
algorithm (see [Bis95] for a good explanation):
0N C =
p(N C | xi )xi /
iS
0
2
(N
C) =
p(N C | xi )
iS
p(N C | xi )(xi 0N C )2 /
iS
p0 (N C) =
1 X
p(N C | xi ) ,
|S|
X
iS
p(N C | xi )
(8.12)
iS
where p(N C | xi ) is the a posteriori probability for a no-change pixel conditional on measurement xi . We have the following rules for determining p(N C | xi ):
1. i SN C :
p(N C | xi ) = 1
2. i SC :
p(N C | xi ) = 0
1 The symbols and \ denote set union and set difference, respectively. These sets can be determined
in practice by setting generous, scene-independent thresholds for change and no-change pixel intensities, see
[BP00].
77
which can be iterated numerically to improve the initial estimates of the distributions. One
can then determine e.g. the upper change threshold as the appropriate solution of
p(x | N C)p(N C) = p(x | C+)p(C+).
Taking logarithms,
1
N C p(C+)
1
2
2
(x
(x
)
=
log
=: A
C+
NC
2
2
2C+
2N
C+ P (N C)
C
with solutions
x=
2
2
C+ N
C N C C+ N C C+
2
2
(N C C+ )2 + 2A(N
C C+ )
2
2
N
C C+
8.5
Radiometric normalization
78
homogeneous and can be approximated by linear functions. The critical aspect is the determination of suitable time-invariant features upon which to base the normalization.
As we have seen, the MAD transformation invariant to linear and affine scaling. Thus, if
one uses MAD for change detection applications, preprocessing by linear radiometric normalization is superfluous. However radiometric normalization of imagery is important for many
other applications, such as mosaicing, tracking vegetation indices over time, supervised and
unsupervised land cover classification, etc. Furthermore, if some other, non-invariant change
detection procedure is preferred, it must generally be preceded by radiometric normalization [CNS04]. Taking advantage of this invariance, one can apply the MAD transformation
to select the no-change pixels in bitemporal images, and then used them for radiometric
normalization. The procedure is simple, fast and completely automatic and compares very
favorably with normalization using hand-selected, time-invariant features.
An ENVI plug-in for radiometric normalization with the MAD transformation is given in Appendix D.5.3.
Chapter 9
Unsupervised Classification
Supervised classification of multispectral remote sensing imagery is commonly used for landcover determination, see Chapter 10. For supervised classification it is very important to
define training areas which adequately represent the spectral characteristics of each class in
the image to be classified, as the quality of the training set has a significant effect on the
classification process and its accuracy. Finding and verifying training areas can be rather
laborious since the analyst must select representative pixels for each of the classes. This
must be done by visual examination of the image data and by information extraction from
additional sources such as ground reference data (ground truth) or existing maps.
Unlike supervised classification, clustering methods (or unsupervised methods) require
no training sets at all. Instead, they attempt to find the underlying structure automatically
by organizing the data into classes sharing similar, e.g. spectrally homogeneous, characteristics. The analyst simply needs to specify the number of clusters present. Clustering plays
an especially important role when very little a priori information about the data is available and provides a useful method for organizing a large set of data so that the retrieval
of information may be made more efficiently. A primary objective of using clustering algorithms for pre-classification of multispectral remote sensing data in particular is to obtain
optimum information for the selection of training regions for subsequent supervised land-use
segmentation of the imagery.
9.1
We begin with the assumption that the measured features (pixel intensities)
x = {xi | i = 1 . . . n}
are chosen independently from K multivariate normally distributed populations corresponding the K principal land cover categories present in the image. The xi are thus realization
of random vectors
Xk N (k , k ), k = 1 . . . K.
(9.1)
Here k and k are the expected value and covariance matrix of Xk , respectively. We
denote a given clustering by C = {C1 , . . . Ck , . . . CK } where Ck denotes the index set for
the kth cluster.1 We wish to maximize the posteriori probability p(C | x) for observing the
1 The
79
80
p(x | C)p(C)
.
p(x)
(9.2)
The quantity p(x|C) is the joint probability density function for clustering C, also referred to
as the likelihood of observing the clustering C given the data x, P (C) is the prior probability
for C and p(x) is a normalization independent of C.
The joint probability density for the data is the product of the individual probability
densities, i.e.,
p(x | C) =
K
Y
Y
p(xi | Ck )
k=1 iCk
K
Y
Y
N/2
(2)
1/2
|k |
k=1 iCk
1
> 1
exp (xi k ) k (xi k ) .
2
Forming the product in this way is justified by the independence of the samples. The
log-likelihood is given by [Fra96]
L = log p(x | C)
K
X
X N
1
1
=
)
.
i
k
k
2
2
2
k=1 iCk
(9.3)
k = 1 . . . K,
(9.4)
K
X
X (xi )> (xi )
1
k
k
(xi ) k ) ( 2 I)(xi k ) =
2
2 2
>
k=1 iCk
K
X
X (xi )> (xi )
k
k
log p(C).
2 2
(9.5)
k=1 iCk
81
uki = 1,
i = 1 . . . n,
(9.7)
k=1
meaning that each sampled pixel xi , i = 1 . . . n, belongs to precisely one class, and
n
X
uki > 0,
k = 1 . . . K,
(9.8)
i=1
meaning that no class Ck , k = 1 . . . K, is empty. The sum in (9.8) is the number nk of pixels
in the kth class. An unbiased estimate mk of the expected value k for the kth cluster is
therefore given by
Pn
uki xi
1 X
k m k =
xi = Pi=1
, k = 1 . . . K,
(9.9)
n
nk
i=1 uki
iCk
k = 1 . . . K.
(9.10)
K X
n
X
k=1 i=1
uki
(9.11)
Finally, if we do not wish to include prior probabilities, we can simply say that all clustering
configurations C are a priori equally likely. Then the last term in (refe911) is independent of
C and we have, dropping the multiplicative constant 1/2 2 , the well-known sum-of-squares
cost function
K X
n
X
uki (xi mk )> (xi mk ).
(9.12)
E(C) =
k=1 i=1
9.2
We begin with the popular K-means method and then consider an algorithm due to (Palubinskas 1998) [Pal98], which uses cost function (9.11) and for which the number of clusters
is determined automatically. Then we discuss a common version of bottom-up or agglomerative hierarchical clustering, and finally a fuzzy version of the K-means algorithm.
9.2.1
K-means
The K-means clustering algorithm (KM) (sometimes referred to as basic Isodata [DH73] or
migrating means [JRR99]) is based on the cost function (9.12). After initialization of the
cluster centers, the distance measure corresponding to a minimization of (9.12), namely
d(i, k) = (xi mk )> (xi mk )
is used to re-cluster the pixel vectors. Then (9.9) is used to recalculate the cluster centers.
This procedure is iterated until the centers cease to change significantly. K-means clustering
may be performed within the ENVI environment from the main menu.
82
9.2.2
Extended K-means
Denote by pk = p(Ck ) the prior probability for cluster k. The entropy S associated with
this prior distribution is
K
X
S=
pk log pk .
(9.13)
k=1
Distributions with high entropy are those for which the pi are all similar, that is, the pixels
are distributed evenly over all available clusters, see [Bis95]. Low entropy means that most
of the data are concentrated in very few clusters. We choose a prior distribution p(C) in
(9.11) for which few clusters are more probable than many clusters, namely
p(C) exp(E S) = exp E
K
X
pk log pk ,
k=1
K X
n
X
X
(xi mk )> (xi mk )
E
pk log pk .
2
2
K
uki
k=1 i=1
(9.14)
k=1
With
nk
1X
=
uki
n
n i=1
n
pk
(9.15)
this becomes
E(C) =
K X
n
X
uki
k=1 i=1
(xi mk )> (xi mk ) E
log pk .
2 2
n
(9.16)
An estimate for the parameter E may be obtained as follows [Pal98]: From (9.14) and
(9.15)
K
X
nk2 pk
E(C)
p
log
p
E k
k .
2 2
k=1
Equating the likelihood and prior terms in this expression and taking k2 2 and pk 1/K,
2 log(1/K)
The parameter 2 can be estimated from the data.
The extended K-means (EKM) algorithm is as follows: First an initial configuration with
a very large number of clusters K is chosen (for one-dimensional data this might conveniently
be the 256 gray values that a pixel with 8-bit resolution can have) and initial values
mk =
n
1 X
uki xi ,
nk i=1
pk =
nk
n
(9.18)
are determined. Then the data are re-clustered according to the distance measure corresponding to a minimization of (9.16):
d(i, k) =
log pk .
2 2
n
(9.19)
83
The prior term tends to reduce the number of clusters and any class which has in the course
of the algorithm nk = 0 is simply dropped from the calculation. (Condition (9.8) is thus
relaxed.) Iteration of (9.18) and (9.19) continues until no significant changes in the mk
occur.
The explicit choice of the number of clusters K is replaced by the necessity of choosing a
value for the meta-parameter E . This has the advantage that we can use one parameter
for a wide variety of images and let the algorithm itself decide on the actual value of K in
any given instance.
9.2.3
The agglomerative hierarchical clustering algorithm that we consider here is, as for K-means,
based on the cost function (9.12), see [DH73]. It begins by assigning each pixel in the dataset
to its own class or cluster. At this stage of course, the cost function E(C), Eq. (9.12), is
zero. We write E(C) in the form
K
X
Ek
(9.20)
E(C) =
k=1
where Ek is given by
Ek =
iCk
Every agglomeration of clusters to form a smaller number of clusters will increase E(C).
We therefore seek a prescription for choosing two clusters for combination that will increase
E(C) by the smallest amount possible.
Suppose clusters k with nk members and ` with n` members are merged, k < `, and the
new cluster is labeled k. Then
mk
n k mk + n ` m `
=: m.
n k + n`
Ek =
iCk C`
and E` disappears. The net change in E(C) is therefore, after some algebra,
X
X
(k, `) =
(xi m)
> (xi m)
(xi mk )> (xi mk )
iCk C`
iCk
>
(xi m` ) (xi m` )
(9.21)
iC`
nk n`
(mk m` )> (mk m` ).
nk + n`
The minimum increase in E(C) is achieved by combining those two clusters k and ` which
minimize the above expression. Given two alternative candidate cluster pairs with similar combined memberships nk + n` and whose means have similar euclidean separations
kmk m` k, this prescription obviously favors combining that pair with the larger discrepancy between nk and n` . Thus similar-sized clusters are preserved and smaller clusters are
absorbed by larger ones.
84
Let hk, `i represent the cluster formed by combination of the clusters k and `. Then the
increase in cost incurred by combining this cluster with cluster r can be determined from
(9.21) as
(nk + nr )(k, r) + (n` + nr )(`, r) nr (k, `)
.
(9.22)
(hk, `i, r) =
nk + n` + nr
Once
1
(xi xj )> (xi xj )
2
for i, j = 1 . . . n has been initialized from (9.21) for all possible combinations of pixels, the
recursive formula (9.22) can be used to calculate efficiently the cost function for any further
combinations without reference to the original data.
The algorithm terminates when the desired number of clusters has been reached or
continues until a single cluster has been formed. Assuming that the data consist of K
compact and well separated clusters, the slope of E(C) vs. the number of clusters K should
9.2.4
Fuzzy K-means
For q > 1 we write (9.9) and (9.12) in the equivalent forms [Dun73]
Pn
uqki xi
mk = Pi=1
k = 1 . . . K,
n
q ,
i=1 uki
E(C) =
K X
n
X
(9.23)
(9.24)
k=1 i=1
and make the transition from hard to fuzzy clustering by replacing (9.6) by continuous
variables
0 < uki < 1, k = 1 . . . K, i = 1 . . . n,
(9.25)
but retaining requirements (9.7) and (9.8). The matrix u is now a fuzzy class membership
matrix.
With i fixed, we seek values for the uki which solve the minimization problem
Ei =
K
X
i = 1 . . . n,
k=1
Li
= q(uki )q1 (xi mk )> (xi mk ) = 0,
uki
k = 1 . . . K,
9.3. EM CLUSTERING
85
s
q1
uki =
q1
1
.
(xi mk )> (xi mk )
(9.26)
s
q1
k=1
1
,
(xi mk )> (xi mk )
uki = P
K
k0 =1
1
(xi mk )> (xi mk )
q1
k = 1 . . . K, i = 1 . . . n.
(9.27)
1
(xi mk
0 )> (x
i mk0 )
9.3
EM Clustering
The EM (= expectation maximization) algorithm, (see e.g. [Bis95]) replaces uki in (9.27)
by the posterior probability p(Ck | xi ) of class Ck given the observation xi . That is, using
Bayes theorem,
uki p(Ck | xi ) p(xi | Ck )p(Ck ).
Here p(xi | Ck ) is taken to be a multivariate normal distribution function with estimated
mean mk and estimated covariance matrix Fk given by (9.9) and (9.10), respectively. Thus
1
1
uki p(Ck ) p
exp (xi mk )> F1
(x
m
)
.
(9.28)
i
k
k
2
|Fk |
One can use the current class membership to estimate P (Ck ) as pk according to (9.15).
The EM algorithm is then an iteration of equations (9.9), (9.10), (9.15) and (9.28) with
the same termination condition as for the fuzzy K-means algorithm, see also Eqs. (8.12).
After each iteration the columns of u are normalized according to (9.7). Because of the
exponential distance dependence of the membership probabilities in (9.28), the algorithm
is very sensitive to initialization conditions, and can even become unstable. To avoid this
problem, one can first obtain initial values for the mk and for u by preceding the calculation
with the fuzzy K-means algorithm. Explicitly:
Algorithm (EM clustering)
1. Determine starting values for cluster centers mk and initial memberships uki from
the FKM algorithm.
86
9.3.1
Simulated annealing
Even with initialization using the fuzzy K-means algorithm the EM algorithm may be
trapped in a local optimum. An alternative scheme is to apply so-called simulated annealing.
Essentially the initial memberships are random and only gradually are the calculated class
memberships allowed to influence the estimation of the class centers [Hil01]. The rate of
reduction of randomness is determined by a temperature parameter. For example, the class
memberships in (9.28) may replaced by
uki uki (1 r1/T )
on each iteration, where T is initialized to T0 and reduced at each iteration by a factor c < 1:
T cT
and where r (0, 1) is a uniformly distributed random number. As T approaches zero,
uki will be determined more and more by the probability distribution parameters alone in
(9.28).
9.3.2
Partition density
Since the simple cost function E(C) of (9.12) is no longer relevant, we choose with [GG89]
the partition density as a criterion for choosing the best number of clusters. The fuzzy
hypervolume, defined as
K p
X
F HV =
|Fk |,
k=1
is proportional to the volume in feature space occupied by the ellipsoidal clusters generated
by the algorithm. For instance, for a two dimensional cluster with an elliptical probability
density we have, in its principal axis coordinate system,
s
p
12 0
= 1 2 area (volume) of the ellipse.
|| =
0 22
Summing the memberships of the observations within one standard deviation of each cluster
center,
n X
K
X
S=
uik , i {i | (xi mk )> F1
k (xi mk ) < 1},
i=1 k=1
(9.29)
ate normally distributed pixels, the partition density should exhibit a maximum at K = K.
An ENVI plug-in for EM clustering is given in Appendix D.6.3.
9.3. EM CLUSTERING
9.3.3
87
The algorithms described thus far make exclusive use of the spectral properties of the individual observations (pixels). Spatial relationships within an image such as large scale,
coherent regions, textures etc. are ignored entirely.
The EM algorithm determines the a posteriori class membership probabilities of each
observation for the classes in question. In this section we describe a post-processing technique
to take account of some of the spatial information implicit in the classified image in order to
improve the original classification. This technique makes use of the vectors of a posteriori
probabilities associated with each classified pixel.
Figure 9.1 shows schematically a single pixel m together with its immediate neighborhood
n, which we take to consist of the four pixels above, below, to the left and to the right of
m. Let its a posteriori probabilities be
Pm (Ck ),
k = 1 . . . M,
M
X
Pm (Ck ) = 1,
k=1
+
2
1
k = 1 . . . M,
Pm Qm
,
P>
m Qm
(9.30)
88
that
M
X
0
Pm
(Ck ) = 1.
k=1
The neighborhood function must somehow reflect the spatial structure of the image. In
order to define it we first postulate a compatibility measure
Pmi (Ck |Cl ),
i = 1 . . . 4,
namely, the conditional probability that pixel m belongs to class Ck , given that the neighboring pixel i, i = 1 . . . 4, belongs to Cl . A small piece of evidence that m should be
classified to Ck would then be
Pmi (Ck |Cl )Pi (Cl ),
i = 1 . . . 4,
that is, the conditional probability that pixel m is in class Ck if neighboring pixel i is in
class Cl , i = 1 . . . 4.
We obtain a Neighborhood function Qm (Ck ) by summing over all pieces of evidence:
1 XX
Pmi (Ck |Cl )Pi (Cl )
4 i=1
4
Qm (Ck ) =
l=1
M
X
(9.31)
l=1
Pn (Cl ) =
and where Pmn (Ck |Cl ) also corresponds to the average compatibility of pixel m with its
entire neighborhood. We can write (9.31) again as a vector equation,
Qm = Pmn Pn
and (9.30) finally as
P0m =
Pm (Pmn Pn )
.
P>
m Pmn Pn
(9.32)
The matrix of average compatibilities Pmn can be estimated directly from the original
classified image. A random central pixel m is chosen and its calss Ci determined. Then, again
randomly, a pixel j out of its neighborhood its chosen and its class Cj is also determined.
Thereupon the matrix element Pmn (Ci |Cj ) (which was initialized to 0) is incremented by 1.
This is repeated many times and finally the rows of the matrix are normalized.
Equation (10.15) is well-suited for a simple algorithm:
Algorithm (Probabilistic label relaxation)
1. Carry out a supervised classification, e.g. with a FFN, and determine the compatibility matrix Pmn .
89
9.4
The Kohonen self organizing map, a simple example of which is sketched in Fig. 9.2 , belongs
to a class of neural networks which are trained by competitive learning, [HKP91, Koh89].
The single layer of neurons can have any geometry, usually one- two- or three-dimensional.
The input signal is represented by the vector
x = (x1 , x2 . . . xN )> .
Each input to a neuron is associated with a synaptic weight, so that for M neurons, the
synaptic weights can represented as a (M N ) matrix
w11
w21
w=
...
w12
w22
..
.
..
.
w1N
w2N
.
..
.
wM 1
wM 2
wM N
The components of the vector wk = (wk1 , wk2 . . . wkN )> are thus the synaptic weights of
the kth neuron.
We interpret the vectors
{x(i)|i = 1 . . . p}.
as training data for the neural network. The synaptic weight vectors are to be adjusted so
as to reflect in some way the clustering of the training data in the N -dimensional feature
space.
When a training vector x is presented to the input of the network, the neuron whose
weight vector wk lies nearest to x is designated to be the winner. Distances are given by
(x wk )> (x wk ).
Call the winner k . Then its weight vector is shifted a small amount in the direction of the
training vector:
wk (i + 1) = wk (i) + (x(i) wk (i)),
where wk (i + 1) is the weight vector after presentation of the ith training vector, see Fig.
9.3. The parameter is called the learning rate of the network.
90
Figure 9.2: The Kohonen feature map in two dimensions with a two-dimensional input.
The intention is to repeat this learning procedure until the synaptic weight vectors reflect
the class structure of the training data, thus achieving a vector quantization of the feature
space. In order for this method to function, it is necessary to allow the learning rate to
decrease gradually during the training process. A convenient function for this is
i/p
min
(i) = max
.
max
However the Kohonen feature map goes a step further and tries to map the topology of the
feature space onto the network. This is achieved by defining a neighborhood function for the
winner neuron on the network of neurons. Usually a Gauss function of the form
(k , k) = exp(d2 (k , k)/2 2 )
is used, where d2 (k , k) is the square of the distance between neurons k and k. For example,
for a two-dimensional array of m m neurons
d2 (k , k) =[(k 1) mod m (k 1) mod m]2
+ [(k 1) div m (k 1) div m]2 ,
whereas for a cubic m m m array.
d2 (k , k) = [(k 1) mod m (k 1) mod m]2
+ [((k 1) div m (k 1) div m) mod m]2
+ [(k 1) div m2 (k 1) div m2 ]2 .
During the learning phase not only the winner neuron, but also the neurons in its neighborhood are moved in the direction of the training vectors:
wk (i + 1) = wk (i) + (i)(k , k)(x(i) wk (i)), k = 1 . . . M.
91
wk (i)
7
N
>
wk (i + 1)
:
x(i)
Figure 9.3: Movement of synaptic weight vector in the direction of training vector.
9.5
We mention finally an extension of the procedure used to determine change/no-change decision thresholds discussed in Section 8.4.7. Rather than clustering the MAD change components individually as was done there, we can use any of the algorithms introduced in
this chapter (except the Kohonen SOM) to classify the changes. Because of its ability to
accommodate correlated clusters, we prefer the EM algorithm.
Clustering of the change pixels can of course be applied in the full MAD or MNF/MAD
feature space, where the number of clusters chosen determines the number of change categories. The approximate chi-square distribution of the sum of squares of the standardized
variates allows the labelling of pixels with high no-change probability. These can be excluded from the clustering process e.g. by freezing their a posteriori probabilities to 1
for the no-change class, thereby speeding up the calculation considerably. Routines for
change classification using the EM algorithm are included in the ENVI GUI for
viewing change detection images given in Appendix D.6.6.
92
Chapter 10
Supervised Classification
The pixel-oriented, supervised classification of multispectral images is a problem of probability density estimation. On the basis of representative training data for each class, the
probability distributions for all of the classes are estimated and then used to classify all of
the pixels in the image. We will consider three methods or models for supervised classification: a parametric model (Bayes maximum likelihood), a non-parametric model (Gaussian
kernel) and a mixture model (the feed-forward neural network). The basis for all of these
classifiers is Bayes decision rule, which we consider first.
10.1
The a posteriori probabilities for class Ck , Eq. (2.3), can be written for N -diminsional
training data and M classes in the form
P (Ck |x),
k = 1 . . . M, x = (x1 . . . xN )> .
(10.1)
Let us define a loss function L(Ci , x) which measures the cost of associating the pixel with
feature vector x with the class Ci . Let ik be the loss incurred if x in fact belongs to class
Ci , but is classified as belonging to class Ck . We can reasonably assume
= 0 if i = k
ik
i, k = 1 . . . M,
(10.2)
> 0 otherwise,
that is, a correct classification incurs no loss. We can now express the loss function as a sum
over the individual losses, weighted according to (10.1):
L(Ci , x) =
M
X
ik P (Ck |x).
(10.3)
k=1
Without further specifying ik , we can define a loss-minimizing decision rule for our classification as
x Ci provided L(Ci , x) < L(Cj , x) for all j = 1 . . . M, j 6= i.
(10.4)
Up till now weve been completely general. Now suppose the losses are independent of the
kind of misclassification that occurs (for instance, the classification of a forest pixel into
93
94
the the class meadow is just as bad as classifying it as urban area, etc). The we can write
ik = 1 ik , .
Thus any given misclassification (i 6= k) costs unity, and a correct classification (i = k) costs
nothing. We then obtain from (10.3)
L(Ci , x) =
M
X
(10.5)
k=1
10.2
(10.6)
Training data
The selection of representative training data is the most difficult and critical part of the
classification process. The standard procedure is to select training areas within the image
which are representative of each class of interest. In the ENVI environment, these are
entered as regions of interest (ROIs), from which the training pixel vectors are generated.
Note that some fraction of the representative data must be withheld for later accuracy
assessment. These are the so-called test data, which are not used for training purposes in
order not to bias the accuracy assessment. Well discuss their use in detail in later in this
chapter.
Suppose there are just two classes, that is M = 2. If we apply decision rule (10.6) to
some measured pixel vector x, the probability of incorrectly classifying the pixel is
r(x) = min[P (C1 |x), P (C2 |x)].
The Bayes error is defined to be the average of r(x) over all pixels,
Z
Z
= r(x)p(x)dx = min[P (C1 |x), P (C2 |x)]p(x)dx
Z
= min[P (x|C1 )P (C1 ), P (x|C2 )P (C2 )]dx,
where we used Bayes rule in the last step. We can use the Bayes error as a measure of the
separability of the two classes, the smaller the error, the better the separability.
Calculating the Bayes error is difficult, but we can at least get an approximate upper
bound as follows. First note that, for any a, b 0,
min[a, b] aS b1S ,
0 S 1.
95
The best upper bound is then determined by minimizing u with respect to S. If we assume
that P (x|C1 ) and P (x|C2 ) are normal distributions with 1 = 2 , then the minimum occurs
at S = 1/2.
We get the Bhattacharyya bound B by using S = 1/2 also for the case where 1 6= 2 :
Z p
p
B = P (C1 )P (C2 )
P (x|C1 )P (x|C2 ) dx.
This integral can be evaluated explicitly. The result is
p
B = P (C1 )P (C2 )eB ,
where B is the Bhattacharyya distance given by
!
,
1
1 + 2 p
1
1 + 2
1
>
(2 1 ) + log
|1 ||2 | .
B = (2 1 )
8
2
2
2
The first term is an average Mahalinobis distance (see below), the second term depends
on the difference between the covariance matrices of the two classes. It vanishes when
1 = 2 . Thus the first term gives the class separability due due the distance between
the class means, while the second term gives the separability due to the difference in the
covariance matrices.
Finally, the Jeffries-Matusita distance measures separability of two classes on a scale
[0 2] in terms of B:
J = 2(1 eB ).
(10.7)
The ENVI menu command
Basic Tools/Region of Interest/Compute ROI Separability
calculates Jeffries-Matusita distances between all pairs of classes defined by a given set of
ROIs.
10.3
P (x|Ci )P (Ci )
P (x)
M
X
P (x|Cj )P (Cj ).
j=1
(x
(x
)
.
(10.8)
i
i
i
2
(2)N/2 |i |1/2
96
According to the first assumption, we only need to associate x to that class Ci which
maximizes P (x|Ci ):
x Ci if P (x|Ci ) > P (x|Cj ) for all j = 1 . . . M, j 6= i.
(10.9)
(10.10)
(10.11)
1 X
i Fi =
(x i )(x i )> ,
ni
xCi
10.4
Non-parametric methods
In non-parametric density estimation we wish to model the probability distribution generated by a given set of training data, without making any prior assumption about the form of
the distribution function. An example is the class of kernel based methods. Here each data
point is used as the center of a simple local probability density and the overall distribution
is taken to be the sum of the local distributions. In N dimensions, we can model the class
probability distribution as
P (y|Ci )
>
2
1 X
1
e(yx) (yx)/2 .
2
N/2
ni
(2 )
xC
i
The quantity is a smoothing parameter which we can choose for example by minimizing
the misclassifications on the training data themselves with respect to .
The kernel based method suffers from the drawback of requiring all training data points
to be stored. This makes the evaluation of the density very slow if the number of training
pixels is large. In general, the complexity grows with the amount of data, not with the
difficulty of the estimation problem itself.
10.5
97
Neural networks
Neural networks belong to the category of mixture models for probability density estimation,
which lie somewhere between the parametric and non-parametric extremes. They make no
assumption about the functional form of the probabilities and can be adjusted flexibly to
the complexity of the system that they are being used to model.
To motivate their use for classification, consider two classes C1 and C2 in a two-dimensional
feature sspace. We could write (10.11) in the form of a discriminant
m(x) = d1 (x) d2 (x)
and say that
x is
C1
C2
if m(x) > 0
if m(x) < 0.
(10.12)
where w = (w1 , w2 )> and w0 are parameters. The decision boundary occurs for m(x) = 0,
i.e. for
w1
w0
x2 = x1
,
w2
w2
see Figure 10.1
w0
w
2
u
u e u
u
e
u
e
m(x) = 0
e
e e
e
e w1
e
e
e
w2
98
1
x1
xi
xN
- 0
w0
- 1
w1
..
.
wi
- i
wN
..
.
- N
~
m(x)
q
:
>
Figure 10.2: An artificial neuron. The first input is always unity and is called the bias.
where
I(x) = w> x + w0 .
This is sometimes justified by the analogy to biological neurons. In IDL (see Figure 10.3):
thisDevice =!D.Name
set_plot, PS
Device, Filename=c:\temp\logistic.eps,xsize=15,ysize=10,/Encapsulated
x=(findgen(100)-50)/10
plot, x,1/(1+exp(-x))
device,/close_file
set_plot,thisDevice
99
There is also a statistical justification, however [Bis95]. Suppose two classes in twodimensional feature space are normally distributed with 1 = 2 = I,
P (x|Ck )
|x k |2
1
exp(
),
2
2
k = 1, 2.
Then we have
P (x|C1 )P (C1 )
P (x|C1 )P (C1 ) + P (x|C2 )P (C2 )
1
=
1 + P (x|C2 )P (C2 )/P (x|C1 )P (C1 )
1
=
.
1
2
1 + exp( 2 [(x 2 ) (x 1 )2 ])(P (C2 )/P (C1 ))
P (C1 |x) =
we get
1
1+
2 |2 |x 1 |2 ] a)
1
1
=
=
>
1 + exp(w x w0 )
1 + eI(x)
= m(x).
P (C1 |x) =
exp( 12 [|x
10.5.1
In order to discriminate any number of classes, multilayer feed-forward networks are often
used, see Figure 10.4. In this figure, the input signal is the N + 1-component vector
x(`) = (1, x1 (`) . . . xN (`))>
for training sample `, which is fed simultaneously to the so-called hidden layer consisting of
L neurons. These in turn determine the L + 1-component vector
n(x) = (1, n1 (x) . . . nL (x))>
according to
nj (x) = g(Ijh (x)),
j = 1 . . . L,
with
Ijh (x) = wjh> x,
where wh> is the hidden weight vector for the jth neuron
h >
wh = (w0h , w1h . . . wL
) .
100
#
-
"!
#
#
1
"!
#
x1 (`)
xN (`)
- 1
*
"!
i
"!
..
.
#
N
"!
n1
#
q
- m1 (`)
1
>
"!
..
.
..
.
"!
..
.
#
xi (`)
Wo
#
~
q
: j
>"!
#
w
nj R
- k
- mk (`)
>
"!
..
.
..
.
#
w
s L
"!
nL
#
U
- mM (`)
RU M
"!
Figure 10.4: A two-layer, feed-forward neural network with L hidden neurons for classification of N -dimensional data into M classes.
n=
o
Wo = (w1o , w2o , . . . wM
),
1
g(Wh> x)
.
The vector n is then fed to the output layer. If we interpret the outputs as probabilities,
then we must ensure that
0 mk 1, k = 1 . . . M,
and, furthermore, that
M
X
mk = 1.
k=1
This can be done by using a modified logistic activation function for the output neurons,
called softmax:
o
eIk (n)
,
mk (n) = I o (n)
o
o
e1
+ eI2 (n) + . . . + eIM (n)
where
Iko (n) = wko> n.
101
10.5.2
Cost functions
We havent yet considered the correct choice of synaptic weights. This procedure is called
training the network. The training data can be represented as the set of labelled pairs
{(x(`), y(`)) | ` = 1 . . . p},
where
y(`) = (0, 0 . . . 0, 1, 0 . . . 0)>
is an M -dimensional vector of zeroes, with a 1 at the kth position to indicate that x(`)
belongs to class Ck . An intuitive training criterion is then the quadratic cost function
1X
ky(`) m(`)k2 .
2
p
E(Wh , Wo ) =
(10.13)
`=1
1
ky(`) m(`)k2 ,
2
` = 1 . . . p.
(10.14)
An alternative cost function can be obtained with the following argument: Choose the
synaptic weights so as to maximize the probability of observing the training data:
P (x(`), y(`)) = P (y(`) | x(`))P (x(`)) max .
The neural network predicts the posterior class membership probability, which we can write
as
M
Y
P (y(`) | x(`)) =
[ mk (x(`)) ]yk (`) .
k=1
For example:
P ((1, 0 . . . 0)> |x) = m1 (x)1 m2 (x)0 smM (x)0 = m1 (x).
Therefore we wish to maximize
M
Y
k=1
102
Taking logarithms, dropping terms which are independent of the synaptic weights and summing over all of the training data, we see that this is equivalent to minimizing the cross
entropy cost function
E(W , W ) =
h
p X
M
X
(10.15)
`=1 k=1
10.5.3
Training
dE(w ) 1
d2 E(w )
+ (w w )2
+ ...
dw
2
dw2
1
= E0 + (w w )2 H + . . . ,
2
where H =
d2 E(w )
dw2
dE(w )
dw
=0
w
2 E(w )
.
wi wj
(10.16)
103
It is symmetric and it must be positive definite for a local minimum. It is positive definite
if all of its eigenvalues are positive, see Appendix C.
A local minimum can be found with various search algorithms. Backpropagation is the
most well-known and extensively used method and is described below. It is used in the standard ENVI neural network for supervised classification. However much better algorithms
exist, such as scaled conjugate gradient or Kalman filter. These are discussed in detail in
Appendix C. ENVI plug-ins for supervised classification with a feed forward neural network trained with conjugate gradient and a fast Kalman filter algorithm
are given in Appendices D.7 and D.8.
10.5.4
Backpropagation
We will develop a training algorithm for the two-layer, feed-forward neural network of Figure
10.4. Our starting point is the local version of the cost function (10.15),
E(`) =
M
X
` = 1 . . . p,
(10.17)
k=1
which we wish to minimize with respect to the synaptic weights represented by the (N +1)L
h
o
matrix Wh = (w1h , w2h , . . . wL
) and the (L + 1) M matrix Wo = (w1o , w2o , . . . wM
).
The following IDL object class FFN mirrors the network architecture of Figure 10.4 and
will form the basis for the implementation of the training algorithms developed here and in
Appendix C:
;+
; NAME:
;
FFN__DEFINE
; PURPOSE:
;
Object class for implementation of a two-layer, feed-forward
;
neural network for classification of multi-spectral images.
;
This is a generic class with no training methods.
;
Ref: M. Canty, Fernerkundung mit neuronalen Netzen, Expert 1999
; AUTHOR
;
Mort Canty (2005)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
ffn = Obj_New("FFN",Xs,Ys,L)
;
; ARGUMENTS:
;
Xs: array of observation column vectors
;
Ys: array of class label column vectors of form (0,0,1,0,0,...0)^T
;
L:
number of hidden neurons
; KEYWORDS
;
None
; METHODS (external):
;
FORWARDPASS: propagate a biased input column vector through the network
104
;
returns the softmax probabilities vector
;
m = ffn -> ForwardPass()
;
CLASS: return the class for an for an array of observation column vectors X
;
return the class probabilities in array variable PROBS
;
c = ffn -> Class(X,Probs)
COST: return the current cross entropy
;
;
c = ffn -> Cost()
; DEPENDENCIES:
;
None
;--------------------------------------------------------------
105
End
Function FFN:: class, X, Probs
; vectorized class membership probabilities
nx = n_elements(X[*,0])
Ones = fltarr(nx) + 1.0
N = [[Ones],[1/(1+exp(-transpose(*self.Wh)##[[Ones],[X]]))]]
Io = transpose(*self.Wo)##N
maxIo = max(Io,dimension=2)
for k=0,self.MM-1 do Io[*,k]=Io[*,k]-maxIo
A = exp(Io)
sum = total(A,2)
Probs = fltarr(nx,self.MM)
for k=0,self.MM-1 do Probs[*,k] = A[*,k]/sum
; vectorized class memberships
maxM = max(Probs,dimension=2)
M=fltarr(self.MM,nx)
for i=0,self.MM-1 do M[i,*]=Probs[*,i]-maxM
return, byte((where(M eq 0.0) mod self.MM)+1)
End
Function FFN:: cost
Ones = fltarr(self.p) + 1.0
N = [[Ones],[1/(1+exp(-transpose(*self.Wh)##[*self.Xs]))]]
Io = transpose(*self.Wo)##N
maxIo = max(Io,dimension=2)
for k=0,self.MM-1 do Io[*,k]=Io[*,k]-maxIo
A = exp(Io)
sum = total(A,2)
Ms = fltarr(self.p,self.MM)
for k=0,self.MM-1 do Ms[*,k] = A[*,k]/sum
return, -total((*self.Ys)*alog(Ms))
End
Pro FFN__Define
struct = { FFN, $
NN: 0L,
LL: 0L,
MM: 0L,
Wh:ptr_new(),
Wo:ptr_new(),
Xs:ptr_new(),
Ys:ptr_new(),
N:ptr_new(),
p: 0L
}
End
$
$
$
$
$
$
$
$
$
;input dimension
;number of hidden units
;output dimension
;hidden weights
;output weights
;training pairs
;output vector from hidden layer
;number of training pairs
106
h
wij
with
h
wij
E(`)
h .
wij
eIk (`)
mk (`) = I o (`)
,
o
o
e 1 + eI2 (`) + . . . + eIM (`)
(10.18)
E(`)
o ,
wjk
j = 0 . . . L, k = 1 . . . M.
k = 1 . . . M,
(10.19)
E(`)
Iko (`)
and is the negative rate of change of the local cost function with respect to the activation
of the kth output neuron. Again applying the chain rule and with (10.16) and (10.18),
M
X
E(`)
E(`) mk0 (`)
=
o
Iko (`)
m
k0 (`) Ik (`)
0
k =1
M
X
k0 =1
yk0 (`)
mk0 (`)
!
.
107
0 if k =
6 k0
1 if k = k 0 .
Continuing,
M
M
X
X
yk0 (`)
E(`)
0
0
=
m
(`)(
m
(`))
=
y
(`)
+
m
(`)
yk0 (`).
k
kk
k
k
k
Iko (`)
mk0 (`)
0
0
k =1
k =1
(10.20)
Thus from (10.19) the third step in the backpropagation algorithm can be written in the
form
Wo (` + 1) Wo (`) + n(`) o> (`).
(10.21)
Note that the second term on the right hand side of (10.21) is an outer product, giving a
matrix of dimension (L + 1) M matching that of Wo .
For the hidden weights, step 4 of the algorithm, we proceed similarly:
E(`) Ijh (`)
E(`)
=
= jh (`)x(`),
h
Wj
Ijh (`) Wjh
j = 1 . . . L,
(10.22)
where jh (`) is the negative rate of change of the local cost function with respect to the
activation of the jth hidden neuron:
jh (`) =
E(`)
.
Ijh (`)
M
M
M
X
E(`) Iko (`) X o Iko (`) X o wko> n(`)
=
(`)
=
k (`)
.
k
Ik0 (`) Ijh (`)
Ijh (`)
Ijh (`)
k=1
k=1
k=1
Since only the output of the jth hidden neuron is a function of Ijh (`) = wjh> x(`), we have
jh (`) =
M
X
o
ko (`)wjk
k=1
nj (`)
.
Ijh (`)
1
h
1 + eIj
108
for which
dnj
= n(x)(1 n(x)).
dx
M
X
o
ko (`)wjk
nj (`)(1 nj (`)),
k=1
(10.23)
Note that the fact that 1 n0 (`) = 0 is made explicit in the above expression. Equation
(10.23) is the origin of the term backpropagation, since it propagates the output error o
backwards through the network to determine the hidden unit error h .
Finally, with (10.22) we obtain the update rule for step 4 of the backpropagation algorithm,
Wh (` + 1) Wh (`) + x(`) h> (`).
(10.24)
The choice of an appropriate learning rate is problematic: small values imply slow
convergence and large values produce oscillation. Some improvement can be achieved with
an additional parameter called momentum. We replace (10.21) with
Wo (` + 1) := Wo (`) + o (`) + o (` 1),
(10.25)
where
o (`) = n(`) o> (`),
and is the momentum parameter. A similar expression replaces (10.24). Typical choices
for the backpropagation parameters are = 0.01 and = 0.5.
Here is an object class extending FFN which implements backpropagation:
;+
; NAME:
;
FFNBP__DEFINE
; PURPOSE:
;
Object class for implementation of a two-layer, feed-forward
;
neural network for classification of multi-spectral images.
;
Implements ordinary backpropagation training.
;
Extends the class FFN
;
Ref: M. Canty, Fernerkundung mit neuronalen Netzen, Expert 1999
; AUTHOR
;
Mort Canty (2005)
;
Juelich Research Center
;
m.canty@fz-juelich.de
109
; CALLING SEQUENCE:
;
ffn = Obj_New("FFNBP",Xs,Ys,L)
; ARGUMENTS:
;
Xs: array of observation column vectors
;
Xs: array of class label column vectors of form (0,0,1,0,0,...0)^T
L:
number of hidden neurons
;
; KEYWORDS
;
None
; METHODS:
;
TRAIN: train the network
;
ffn -> train
; DEPENDENCIES:
;
FFN__DEFINE
;
PROGRESSBAR (FSC_COLOR)
;-------------------------------------------------------------Function FFNBP::Init, Xs, Ys, L
catch, theError
if theError ne 0 then begin
catch, /cancel
ok = dialog_message(!Error_State.Msg + Returning..., /error)
return, 0
endif
; initialize the superclass
if not self->FFN::Init(Xs, Ys, L) then return, 0
self.iterations = 10*self.p
self.cost_array = ptr_new(fltarr((self.iterations+100)/100))
return, 1
End
Pro FFNBP::Cleanup
ptr_free, self.cost_array
self->FFN::Cleanup
End
Pro FFNBP::Train
iter = 0L
iter100 = 0L
eta = 0.01
; learn rate
alpha = 0.5 ; momentum
progressbar = Obj_New(progressbar, Color=blue, Text=0,$
title=Training: exemplar number...,xsize=250,ysize=20)
progressbar->start
window,12,xsize=400,ysize=400,title=Cost Function
wset,12
inc_o1 = 0
inc_h1 = 0
repeat begin
if progressbar->CheckCancel() then begin
print,Training interrupted
110
progressbar->Destroy
return
endif
; select exemplar pair at random
ell = long(self.p*randomu(seed))
x=(*self.Xs)[ell,*]
y=(*self.Ys)[ell,*]
; send it through the network
m=self->forwardPass(x)
; determine the deltas
d_o = y - m
d_h = (*self.N*(1-*self.N)*(*self.Wo##d_o))[1:self.LL] ; d_h is now a row vector
; update the synaptic weights
inc_o = eta*(*self.N##transpose(d_o))
inc_h = eta*(x##d_h)
*self.Wo = *self.Wo + inc_o + alpha*inc_o1
*self.Wh = *self.Wh + inc_h + alpha*inc_h1
inc_o1 = inc_o
inc_h1 = inc_h
; record cost history
if iter mod 100 eq 0 then begin
(*self.cost_array)[iter100]=alog10(self->cost())
iter100 = iter100+1
progressbar->Update,iter*100/self.iterations,text=strtrim(iter,2)
plot,*self.cost_array,xrange=[0,iter100],color=0,background=FFFFFFXL,$
xtitle=Iterations/100),ytitle=log(cross entropy)
end
iter=iter+1
endrep until iter eq self.iterations
progressbar->destroy
End
Pro FFNBP__Define
struct = { FFNBP, $
cost_array: ptr_new(), $
iterations: 0L, $
Inherits FFN $
}
End
In the Train method, the training pairs are chosen at random, rather than cyclically as
indicated in the backpropagation Algorithm.
10.6
Evaluation
The rate of misclassification offers us a reasonable and obvious basis not only for evaluating
the quality of classifiers, but also for their comparison, for example to compare the feedforward network with Bayes maximum-likelihood. We shall characterize this rate in the
following with the parameter . Through classification of test data which have not been
10.6. EVALUATION
111
used for training, we can obtain unbiased estimates of . If, for n test data, y are found to
have been misclassified, then an intuitive value for this estimate is
=: .
n
(10.26)
However the estimated misclassification rates alone are insufficient for model comparison.
We require their uncertainties as well.
10.6.1
The classification of a single test datum is a random experiment, whose possible result we
A}: A=
misclassified, A = correctly classified. We define a
can characterize as the set {A,
real-valued function on this set, i.e. a random variable
= 1,
X(A)
X(A) = 0,
(10.27)
with probabilities
P (X = 1) = = 1 P (X = 0).
The expectation value of this random variable is
hXi = 1 + 0(1 ) =
(10.28)
var(X) = hX 2 i hXi2 = 12 + 02 (1 ) 2 = (1 ).
(10.29)
For the classification of n test data, denoted by random variables X1 . . . Xn , the random
variable
(10.30)
Y = X1 + X2 + . . . Xn
is clearly the associated number of misclassifications. Since
hY i = hX1 i + . . . + hXn i = n
we obtain
y
1
= hY i =
n
n
as an unbiased estimate of the rate of misclassifications.
From the independence of the Xi , i = 1 . . . n, the variance of Y is given by
(10.31)
= 2 (hY 2 i hY i2 ) = 2 var(Y ),
=
2
n
n
n
n
n
or
var
Y
n
=
(1 )
.
n
(10.32)
For y observed misclassifications we estimate with (10.31). Then the estimated variance
is given by
y
y
)
Y
y(n y)
(1
n 1 n
var
=
=
,
n
n
n
n3
112
y(n y)
.
n3
(10.33)
The random variable Y is binomially distributed. However for a sufficiently large number
n of test data, the binomial distribution is well-approximated by the normal distribution.
Mean and standard deviation are then sufficient to characterize the distribution function
completely.
10.6.2
Model comparison
A typical value for a misclassification rate is around 0.5. In order to claim that two
values differ from one another significantly, they should lie at least about two standard
deviations apart. If we wish to discriminate values separated by say 0.01, then
should be
no greater than 0.005. From (10.32) this means
0.0052
0.05(1 0.05)
,
n
or n 2000. Thats quite a few. However since we are dealing with pixel data, such a
number of test pixels assuming sufficient training areas are available is quite realistic.
If training and test data are in fact at a premium, there exist efficient alternatives1 to the
simple train-and-test philosophy presented here. However, since they are generally quite
computer-intensive, we wont consider them further.
In order to express the claim that classifier A is better than classifier B more precisely, we
can formulate an hypothesis test. The individual misclassification rates are approximately
normally distributed. If they are also independent we can construct a test statistic S given
by
YA /n YB /n + A B
YA /n YB /n + A B
S= p
=p
.
var(YA /n YB /n)
var(YA /n) + var(YB /n)
We can then use S to decide between the null hypothesis
H0 : A = B ,
Thus under H0 we have S N (0, 1). We choose a decision threshold Z/2 which corresponds to a probability of an error of the first kind. With this probability the null
hypothesis will be rejected although it is in fact true, see Figure 10.6.
In fact the strict independence of the misclassification rates A and B is not given, since
they are determined with the same set of test data. The above hypothesis test with the
statistic S is therefore too conservative. For dependence we have namely
var(YA /n YB /n) = var(YA /n) + var(YB /n) 2cov(YA /n, YB /n),
1 The buzz-words here are Cross-Validation and Bootstrapping, see [WK91], Chapter 2, for an excellent
introduction.
10.6. EVALUATION
113
(S)
acceptance region
Z/2
Z/2
w
/2
/2
Figure 10.6: Acceptance region for the first hypothesis test. If Z/2 S Z/2 , the null
hypothesis is accepted, otherwise it is rejected.
where the covariance term cov(YA /n, YB /n) is positive. The test statistic S is correspondingly underestimated.
We can formulate a non-parametric hypothesis test which avoids this problem of dependence. We distinguish the following events for classification of the test data:
AB
und AB.
AB,
AB,
is the event test observation is misclassified by A and correctly classified
The variable AB
is the event test observation is correctly classified by A and misclassified by
by B, while AB
B and so on. As before we define random variables:
XAB
, XA
, XA B
B
and XAB
where
= 1,
XAB
(AB)
= XAB
B)
= XAB
XAB
(AB)
(A
(AB) = 0,
with probabilities
P (XAB
= 1) = AB
= 1 P (XAB
= 0).
Corresponding definitions are made for XAB , XAB and XAB .
and AB.
If
Now, in comparing the two classifiers we are interested in the events AB
the number of former is significantly smaller than the number of the latter, then A is better
in which both methods perform poorly are excluded.
than B and vice versa. Events AB
For n test observations the random variables
YAB
= XAB
1 + . . . XAB
n
and
var(YAB
) = nAB
(1 AB
)
hYAB i = nAB ,
var(YAB ) = nAB (1 AB ).
114
We expect that AB
1, that is, var(YAB
) nAB
= hYAB
i. The same goes for YAB
. For
a sufficiently large number of test observationss, the random variables
YAB
hYAB
i
p
hYAB
i
and
YAB hYAB i
p
hYAB i
2
(YAB
(Y hY i)2
hY i)
+ AB
.
hY i
hY i
This statistic, being the sum squares of approximately normally distributed random variables, is chi-square distributed, see Chapter 2.
Let yAB
and yAB
be the number of events actually measured. Then we estimate hY i as
y + yAB
hY i = AB
2
and determine our test statistic as
yAB
+yAB
2
+yAB
2
(yAB
)
(yAB yAB
)
2
2
S =
+
.
yAB
+y
y
+y
AB
AB
AB
2
(y yAB )2
,
S = AB
yAB
+ y AB
(10.34)
the so-called McNemar statistic. It is chi-square distributed with one degree of freedom,
see for example [Sie65]. A so-called continuity correction is usually made to (10.34) and S
written as
2
(|yAB
| 1)
y AB
S =
.
yAB
+ y AB
But there are still reservations! We can only conclude that one classifier is or is not
superior, relative to the common set of training data. We havent taken into account the
variability of the training data, which were sampled just once from their underlying distributions, only that of the test data. If one or both of the classifiers is a neural network, we
have also not considered the variability of the neural network training procedure with respect to the random initialization of the synaptic weights. All this constitutes an extremely
computation-intensive task [Rip96].
10.6.3
Confusion matrices
c11 c12
c21 c22
C=
..
...
.
cM 1
cM 2
s
s
..
.
c1M
c2M
..
.
cM M
10.6. EVALUATION
115
where cij is the number of test pixels from class Ci which are classified as Cj . The misclassification rate is
PM
n i=1 cii
y
n Tr C
= =
=
n
n
n
and only takes into account of the diagonal elements of the confusion matrix.
The Kappa-coefficient make use of all the matrix elements. It is defined as follows:
=
For a purely randomly labeled test pixel, the proportion of correct classifications is approximately
M
X
ci ci
,
n2
i=1
where
ci =
M
X
cij ,
ci =
j=1
M
X
cji .
j=1
= i n P ciici n .
1 i n2
(10.35)
Again, the Kappa coefficient alone tells us little about the quality of the classifier. We
require its standard deviation. This can be calculated in the large sample limit n to
be [BFH75]
!
1 1 (1 1 ) 2(1 1 )(21 2 3 ) (1 1 )2 (4 422 )
=
+
+
,
(10.36)
n (1 2 )2
(1 3 )3
(1 2 )4
where
1 =
M
X
cii
i=1
2 =
M
X
ci ci
i=1
3 =
M
X
cii (ci + ci )
i=1
4 =
M
X
i,j=1
cij (cj + ci )2 .
116
Chapter 11
Hyperspectral analysis
Hyperspectral as opposed to multispectral images combine both high or moderate spatial
resolution with high spectral resolution. Typical sensors (imaging spectrometers) generate
in excess of two hundred spectral channels. Figure 11.1 shows part of a so-called image
cube for the AVIRIS (Airborne Visible/Infrared Imaging Spectrometer) sensor taken over
a region of the Californian coast. Sensors of this kind produce much more complex data
and provide correspondingly much more information about the reflecting surfaces examined.
Figure 11.2 displays the spectrum of a single pixel in the image.
117
118
11.1
Mixture modelling
In working with multispectral images, the fact that at the scale of observation a pixel contains
a mixture of materials is generally treated as a second order effect and more or less ignored.
With the availability of high spectral resolution sensors it has become possible to treat the
problem of the mixed pixel quantitatively.
The basic premise of mixture modelling is that within a given scene, the surface is
dominated by a small number of common materials that have relatively constant spectral
properties. These are referred to as the end-members. It is assumed that the spectral
variability captured by the remote sensing system can be modelled by mixtures of these
components.
11.1.1
Suppose that there are p end-members and ` spectral bands. Then we can denote the
spectrum of the ith end-member by the vector
i
m1
..
i
m = . .
mi`
Now define the matrix of end-members M according
1
m1
..
1
p
M = (m . . . m ) = .
m1`
to
s
..
.
mp1
.. ,
.
mp`
with one column for each end-member. For hyperspectral imagery we always have p `.
119
2(
i 1)
n
i=1
= (g
M)> 1
n (g
p
X
M) 2(
i 1)
i=1
L
=0
1p = 1,
(11.1)
where 1p = (1, 1 . . . 1)> . The first equation determines the mixing coefficients in terms of
known quantities and . The second equation can be used to eliminate .
11.1.2
If we work with MNF-projected data (see next section) then we can assume that n = 2 I.
If furthermore we ignore the constraint on (i.e. = 0), then (11.1) reduces to
= [(M> M)1 M> ]g.
The expression in square brackets is the pseudoinverse of the matrix M, see Chapter 1.
11.1.3
If a spectral library for all of the p end-members in M is available, the mixture coefficients
can be calculated directly. The primary result of the spectral mixture analysis is the fraction
120
images which show the spatial distribution and abundance of the end-member components
in the scene.
If such external data are unavailable, there are various strategies for determining endmembers from the hyperspectral imagery itself. We describe briefly the method recommended in ENVI and implemented in the so-called Spectral Hourglass Wizard.
The first step is to reduce the dimensionality of the data. This is done with the MNF
transformation described in Chapter 3. By examining the eigenvalues of the transformation
and retaining only the components with eigenvalues exceeding one (non-noise components),
the number of dimensions can be reduced substantially, see Figure 11.3.
Figure 11.3: Eigenvalues of the MNF transformation of the image in Figure 11.1.
The so-called pixel purity index (PPI) is then used to find the most spectrally pure, or
extreme, pixels in the remaining data. The most spectrally pure pixels typically correspond
to mixing end-members. The PPI is computed by repeatedly projecting n-dimensional
scatter plots onto a random unit vector. The extreme pixels in each projection are noted
and the number of times each pixel is marked as extreme is recorded. The purest pixels must
must be on the corners, edges or faces of the data cloud. A threshold value is used to define
how many pixels are marked as extreme at the ends of the projected vector. This value
should be 2-3 times the noise level in the data, which is 1 when using the MNF transformed
channels. A minimum of about 5000 iterations is usually required to produce useful results.
When the iterations are completed, a PPI image is created in which the value of each
pixel corresponds to the number of times that pixel was recorded as extreme. So bright
pixels are generally end-members. This image hints at locations and sites that could be
visited for ground truth measurements.
The n-dimensional visualizer, Figure 11.4 can then be used interactively to define classes
of pixels corresponding to end-members and to plot their spectra. These can be saved along
with their pixel locations as ROIs (regions of interest) for later use in spectral unmixing.
This method is repeatable and has the advantage of objectivity in analysis of a data
set to assess dimensionality and define end-members. The primary disadvantage is that it
is a statistical approach dependent upon the specific spectral variance of the image. Thus
the resulting end-members are mathematical constructs which may not be physically interpretable.
121
11.2
Orthogonal subspace projection is a transformation which is closely related to linear unmixing. Suppose that a multispectral image pixel g consists of a mixture of desirable and
undesirable spectra,
g = D + U + n.
The ` ` matrix
(11.2)
An example of the use of this transformation is the suppression of cloud cover from a
multispectral image. First an unsupervised classification is carried out (see Chapter 9) and
the clusters containing the undesired features (clouds) are identified. The mean vectors of
these clusters can then be used as the undesired spectra and combined to form the matrix
U. The the projection (11.2) can be applied to the entire image.
Here is an ENVI/IDL program to implement this idea:
; Orthogonal subspace projection
pro osp, event
print, ---------------------------------
print, Orthogonal Subspace Projection
print, systime(0)
122
print, ---------------------------------
infile=dialog_pickfile(filter=*.dat,/read) ; read in cluster centers
openr,lun,infile,/get_lun
; number of spectral channels
readf,lun,num_channels
readf,lun,K
; number of cluster centers
Ms=fltarr(num_channels,K)
readf,lun,Ms
Us=transpose(Ms)
print,Cluster centers (in the columns)
print,Us
centers=indgen(K)
print,enter undesired centers as 1 (e.g. 0 1 1 0 0 ...)
read,centers
U = Us[where(centers),*]
print,Subspace U
print,U
Identity = fltarr(num_channels,num_channels)
for i=0,num_channels-1 do Identity[i,i]=1.0
P = Identity - U##invert(transpose(U)##U,/double)##transpose(U)
print,projection matrix:
print, P
envi_select, title=Choose multispectral image for projection, $
fid=fid, dims=dims,pos=pos
if (fid eq -1) then goto, done
num_cols = dims[2]+1
num_lines = dims[4]+1
num_pixels = (num_cols*num_lines)
if (num_channels ne n_elements(pos)) then begin
print,image dimensions are incorrect, aborting ...
goto, done
end
image=fltarr(num_pixels,num_channels)
for i=0,num_channels-1 do $
image[*,i]=envi_get_data(fid=fid,dims=dims,pos=pos[i])+0.0
print,projecting ...
; do the projection
image = P ## image
out_array = bytarr(num_cols,num_lines,num_channels)
for i = 0,num_channels-1 do out_array[*,*,i] = $
bytscl(reform(image[*,i],num_cols,num_lines,/overwrite))
base = widget_auto_base(title=OSP Output)
123
124
Appendix A
..
.
(A.1)
ym =
n
X
aj (xj )m + .
j=1
(A.2)
"
#2
Pn
m
X
yi j=1 Aij aj
i=1
125
126
y i
n
X
k = 1 . . . n.
Aij aj Aik = 0,
k = 1 . . . n,
j=1
(A.3)
Eq. (A.3) is referred to as the normal equation. The fitted parameters of the model are thus
estimated by
= (A> A)1 A> y =: Ly.
a
(A.4)
The matrix
= h(L)(L)> i
= Lh> iL>
(A.5)
= 2 LL>
= 2 (A> A)1 .
To check that this is indeed a generalization of the simple linear regression, identify the
parameter vector a with the straight line parameters a and b, i.e.
a1
a
a=
=
.
a2
b
The matrix A and vector y are similarly
1
1
A=
...
1
x1
x2
,
..
.
xm
y1
y2
y=
.. .
.
ym
(A A)
=
Pm
xi
127
1
P 1
m P
m
x
P x2i
=
.
xi
m
x
x2i
>
A y=
y
Pm
x i yi
m
x
m
.
.
P
X
m
xy + xi yi
1
2
P
(m
x
+
m
x
y
)
=
.
i i
x2i + m2 x
2
m x2i + m2 x
2
(A.6)
From (A.3) the uncertainty in b is given by 2 times the (2,2) element of (A> A)1 ,
b2 = 2
m
.
+ m2 x
2
x2i
(A.7)
Equations (A.6) and (A.7) correspond to those for ordinary least squares.
A.2
Suppose that the measurement data in (A.1) are presented sequentially and we wish to
determine the best solution for the parameters a as the new data become available. We can
write Eq. (A.2) in the form
(A.8)
y ` = A` a +
indicating that ` measurements have been made up till now (we assume ` > n), where as
before n is the number of parameters (the length of a). The least squares solution is, with
(A.4),
1 >
= (A>
a
A` y` =: a(`)
` A` )
and, from (A.5), the covariance matrix of a(`) is
1
` = (A>
.
` A` )
(A.9)
(A.10)
Suppose a new observation becomes available. (Well call it (x(` + 1), y(` + 1)) rather
than (x`+1 , y`+1 ), as this simplifies the notation considerably.) Now we must solve the least
squares problem
A`
y`
=
a + ,
y(` + 1)
A(` + 1)
where A(` + 1) = x(` + 1)> . According to (A.10) the solution is
>
A`
y`
a(` + 1) = `+1
.
A(` + 1)
y(` + 1)
(A.11)
128
From (A.9) we can obtain a recursive formula for the covariance matrix `+1 :
>
A`
A`
1
>
`+1 =
= A>
` A` + A`+1 A`+1
A(` + 1)
A(` + 1)
or
1
>
1
`+1 = ` + A(` + 1) A(` + 1).
(A.12)
This simplifies to
a(` + 1) = a(`) + `+1 A(` + 1)> y(` + 1) A(` + 1)a(`) .
Finally, with the definition of the Kalman gain
K`+1 := `+1 A(` + 1)> ,
(A.13)
(A.14)
Equations (A.12A.14) define a so-called Kalman filter for the least squares problem
(A.8). For input x(` + 1) = A(` + 1) the system response A(` + 1)a(`) is calculated in
(A.14) and compared with the measurement y(` + 1). Then the innovation, that is to say
the difference between the measurement and system response, is multiplied by the Kalman
gain determined by (A.13) and (A.12) and the old value a(`) is corrected accordingly.
Relation (A.12) is inconvenient as it calculates the inverse of the covariance matrix `+1
whereas we require the non-inverted form in order to determine the Kalman gain (A.13).
Fortunately (A.12) and (A.13) can be reformed as follows:
`+1 = I K`+1 A(` + 1) `
1
K`+1 = ` A(` + 1)> A(` + 1)` A(` + 1)> + 1 .
(A.15)
129
To see this, first of all note that the second equation in (A.15) is a consequence of the
first equation and (A.13). Therefore it suffices to show that the first equation is indeed the
inverse of (A.12):
1
`+1 1
`+1 = I K`+1 A(` + 1) ` `+1
= I K`+1 A(` + 1) + I K`+1 A(` + 1) ` A(` + 1)> A(` + 1)
= I K`+1 A(` + 1) + ` A(` + 1)> A(` + 1) K`+1 A(` + 1)` A(` + 1)> A(` + 1).
The second equality above follows from (A.12). But from the second equation in (A.15) we
have
K`+1 A(` + 1)` A(` + 1)> = ` A(` + 1)> K`+1
and therefore
>
>
`+1 1
`+1 = I K`+1 A(` + 1) + ` A(` + 1) A(` + 1) (` A(` + 1) K`+1 )A(` + 1) = I
as required.
A.3
Orthogonal regression
In the model for ordinary least squares regression the xs are assumed to be error-free. In
the calibration case where it is arbitrary what we call the reference variable and what we
call the uncalibrated variable to be normalized, we should allow for error in both x and y.
If we impose the model1
yi i = a + b(xi i ), i = 1 . . . m
(A.16)
with and as uncorrelated, white, Gaussian noise terms with mean zero and equal variances
2 , we get for the estimator of b, [KS79],
q
(s2yy s2xx ) + (s2yy s2xx )2 + 4s2xy
b =
(A.17)
2sxy
with
1 X
(yi y)2
m i=1
n
s2yy =
(A.18)
and the remaining quantities defined in the section immediately above. The estimator for a
is
a
= y b
x.
(A.19)
According to [Pat77, Bil89] we get for the dispersion matrix of the vector (
a, b)>
2b(1 + b2 ) x
x(1 + )
2 (1 + ) + sxy /b
x(1 + )
1 +
msxy
with
=
1 The
2b
(1 + b2 )sxy
(A.20)
(A.21)
model in equation (A.16) is often referred to as a linear functional relationship in the literature.
130
2 =
m
(n 2)(1 + b2 )
(A.22)
see [KS79].
It can be shown that estimators of a and b can be calculated by means of the elements
in the eigenvector corresponding to the smallest eigenvalue of the dispersion matrix of the
m by 2 data matrix with a vector of the xs in the first column and a vector of the ys in
the second column, [KS79]. This can be used to perform orthogonal regression in higher
dimensions, i.e., when we have, for example, more x variables than the one variable we have
here.
Appendix B
B.1
Let f and g be two functions of the real numbers IR and define their inner product as
Z
hf, gi =
f (t)g(t)dt.
The inner product space L2 (IR) is the collection of all functions f : IR IR such that
Z
kf k = hf, f i
1/2
1/2
f (t) dt
< .
B.2
1/2
(f (t) g(t)) dt
2
Haar wavelets
Let Vn be the collection of all piecewise constant functions of finite extent1 that have possible
discontinuities at the rational points m 2n , where m and n are integers, m, n Z. Then
all members of Vn belong to the inner product space L2 (IR),
Vn L2 (IR).
Define the the Haar scaling function according to
n
1 if 0 t 1
.
(t) =
0 otherwise
131
(B.1)
132
1
k,k0 .
2n
h1,0 , 0,0 i
0,0 (t) + r(t).
h0,0 , 0,0 i
(B.2)
133
Vn = Vn1 Vn1
= V0 V0 . . . Vn2
Vn1
.
134
and
0,0
1
1
0
1
1
0
=
, 1,0 =
, 1,1 =
.
1
0
1
1
0
1
Thus the orthogonal basis B2 can be represented by the mutually orthogonal vectors
1
1
0
1
1 1 1 0
B2 = ,
.
,
,
1
0
1
1
1
0
1
1
Example: signal compression
We consider the continuous function f (t) = sin(20t)(log t)2 sampled at 64 evenly spaced
points on the interval [0, 1]. The 64 samples comprise a signal vector
f = (f0 , f1 . . . f63 )> = (f (0/63), f (1/63) . . . f (63/63))>
and can also be thought of as a piecewise constant function f(t) belonging to the function
space V6 . The function is shown in Figure B.3.
Figure B.3: The function sin(20t)(log x)2 sampled at 64 points on [0, 1].
We can express the function f(t) in the basis C6 as follows:
(B.3)
1 1
1 1
1 1
1 1
B3 =
1 1
1 1
1 1
1 1
135
matrix of ones and zeroes. This is too large to show
1
1
1
1
0
0
0
0
1
0
0
0 1 0
0
0
1
0
0 1
1
0
0
1
0
0
1 0
0
1 0
0
0
0
0
0
0
0
0
0
,
1
0
1 0
0
1
0 1
for example. The elements of the vector w comprise the wavelet coefficients. They are given
by the wavelet transform
w = B1
6 f.
The wavelet coefficients are thus an alternative way of representing the original signal f(t).
They are plotted in Figure B.4
Figure B.4: The wavelet coefficients w for the signal in Figure B.3.
Notice that many of the coefficients are close to zero. We can define a threshold below
which all coefficients are set exactly to zero. This generally leads to long series of zeroes in
w, so that it can be compressed efficiently,
w wcompr .
Figure B.5 shows the result of reconstructing the signal according to
f = B6 wcompr
after setting a threshold of 0.1. In all, 33 of the 64 wavelet coefficients are zero after
thresholding.
136
B.3
137
Multi-resolution analysis
So far we have considered only functions on the interval [0, 1] with basis functions n,k (t) =
(2n t k), k = 1 . . . 2n 1. We can extend this to functions defined on all real numbers IR
in a straightforward way. For example
{(t k) | k Z}
is a basis for the space V0 of all piecewise constant functions with compact support (finite
extent) having possible breaks at integer values. More generally, a basis for the set Vn of
piecewise constant functions with possible breaks at m 2n and compact support is
{(2n t k) | k Z}.
We can even allow n < 0. For example n = 1 means that the possible breaks are at even
integer values.
We can think of the collection of nested subspaces of piecewise constant functions
. . . V1 V0 V1 V2 . . . L2 (IR),
as being generated by the Haar scaling function . This collection is called a multiresolution
analysis (MRA). A general MRA must have the following properties:
S
1. V = nZ Vn is dense in L2 (IR), that is, for any function f L2 (IR) there exists a
series of functions, one in each Vn , which converges to f . This is true of the Haar
MRA, see Figure 2.7 for example.
T
2. The separation property: I = nZ Vn = {0}. For the Haar MRA, this means that
any function in I must be piecewise constant on all intervals. The only function in
L2 (IR) with this property and compact support is f (t) = 0, so the separation property
is satisfied.
3. The function f (t) Vn if and only if f (2n t) V0 . In the Haar MRA, if f (t) V1
then it is piecewise constant on intervals of length 1/2. Therefore the function f (21 t)
is piecewise constant on intervals of length 1, that is f (21 t) V0 , etc.
4. The scaling function is an orthonormal basis for the function space V0 , i.e. h(t
k), (t k 0 )i = kk0 . This is of course the case for the Haar scaling function.
In the following, we will think of (t) as any scaling function which generates an MRA
in the above sense. Since {(t k) | k Z} is an orthonormal basis for V0 , it follows that
{(2t k) | k Z} is an orthogonal basis for V1 . That is, let f (t) V1 . Then by property
3, f (t/2) V0 and
X
X
f (t/2) =
ak (t k) f (t) =
ak (2t k).
k
X
k
ck (2t k).
(B.4)
138
The constants ck are called the refinement coefficients. For example, the dilation equation
for the Haar wavelets is
(t) = (2t) + (2t 1)
so that the refinement coefficients are c0 = c1 = 1, ck = 0 otherwise.
Note that c20 + c21 = 2. It is easy to show that this is a general property of the refinement
coefficients:
X
X
1X 2
ck (2t k),
ck0 (2t k 0 )i =
ck .
1 = h(t), (t)i = h
2
0
k
Therefore,
c2k = 2,
(B.5)
k=
which is also called Parsevals formula. In a similar way it is easy to show that
ck ck2j = 0, j 6= 0.
(B.6)
k=
B.4
There are many other possible scaling functions that define or generate a MRA. Some of
these cannot be expressed as simple, analytical functions. But once we have the refinement
coefficients for a scaling function, we can approximate that scaling function to any desired
degree of accuracy using the dilation equation. (In fact we can work with a MRA even
when there is no simple analytical representation for the scaling function which generates
it.) The idea is to iterate the refinement equation with a so-called fixpoint algorithm until
it converges to a sequence of points which approximates (t).
Let F be the function that assigns the expression
X
cn (2t n)
F ()(t) =
n
to any function (t), where cn are refinement coefficients. Applying F to the Haar scaling
function:
X
F ()(t) =
cn (2t n) = (t)
n
where the second equality follows from the dilation equation. Thus is a fixpoint of F .
The following recursive scheme can be used to estimate a scaling function with up to
four refinement coefficients:
f0 (t) = t,0
fi (t) = c0 fi1 (2t) + c1 fi1 (2t 1) + c2 fi1 (2t 2) + c3 fi1 (2t 3).
In this scheme, t takes on values of the form m 2n , m, n Z, only. The first definition is
the termination condition for the recursion and approximates the scaling function to zeroth
order as the Dirac delta function. The second relation defines the ith approximation to the
scaling function in terms of the (i 1)th approximation using the dilation equation. We can
calculate the set
j
n
j
=
0
.
.
.
3(2
)
, n 1,
fn
2n
139
Figure B.6: The fixpoint approximation of the Haar scaling function to order n = 4.
Figure B.6 shows the result of n=4 iterations using the refinement coefficients c0 = c1 =
1, c2 = c3 = 0 for the Haar scaling function.
140
B.5
Let f be a signal or function, f L2 (IR), and let Pn (f ) denote its projection onto the space
Vn . We saw in the case of the Haar MRA that we can always write
X hf, n,k i
n,k .
Pn+1 (f ) = Pn (f ) +
hn,k , n,k i
k
(B.7)
where is the scaling function. It can in fact be shown that this is always the case for any
MRA, except that the last expression relating the mother wavelet to the scaling function
is generalized.
Consider now some MRA with a normalized scaling function defined (in the sense of
the preceding section) by the dilation equation (B.4). Since
1
1
h(t), (t)i = ,
2
2
where
ck
hk = .
2
h2k = 1.
(B.9)
Now we assume, in analogy to (B.8), that can be expressed in terms of the scaling function
as
X
gk 2(2t k).
(B.10)
(t) =
k
hk gk = 0.
(B.11)
Similarly,
h(t k), (t m)i =
gi gi2(km) = k,m .
(B.12)
(t) =
(1)k h1k 2(2t k) =
(1)k c1k (2t k).
(B.13)
k
B.6
141
The Daubechies scaling function is derived according to the following two requirements on
an MRA:
1. Compact support: The scaling function (t) is required to be zero outside the interval
0 < t < 3. This means that the refinement coefficients ck vanish for k < 0, k > 3. To see
this, note that
Z
3
(t)(2t + 3)dt = 0
0
and similarly for k = 4, 5 . . . and for k = 6, 7 . . .. Therefore, from the dilation equation,
(1/2) = 0 = c2 (1 + 2) + c1 (1 + 1) + . . . c2 = 0
and similarly for k = 1, 4, 5.
Thus from (B.5), we can conclude that
c20 + c21 + c22 + c23 = 2
(B.14)
c0 c2 + c1 c3 = 0.
(B.15)
k=0
k=0
R
But one can show that an MRA implies (t)dt 6= 0 so we have
c0 + c1 + c2 + c3 = 2.
(B.16)
2. Regularity: All constant and linear polynomials can be written as a linear combination of
the basis {(t k) | k Z} for V0 . This implies that there is no residual in the orthogonal
decomposition of f (t) = 1 and f (t) = t onto the basis, that is,
Z
Z
(t)dt =
t(t)dt = 0.
(B.17)
(B.18)
k=0
3
X
t(2t 1 + k)dt
u+1k
(u)du
4
k=0
Z
Z
0
c0 + c2 2c3
=
u(u)du +
(u)du,
4
4
(1)k+1 ck
(B.19)
142
(B.20)
Equations (B.14), (B.15), (B.16), (B.19) and (B.20) comprise a system of five equations in
four unknowns. A solution is given by
1+ 3
3+ 3
3 3
1 3
c0 =
, c1 =
, c2 =
, c3 =
,
4
4
4
4
which are known as the D4 refinement coefficients. Figure B.7 shows the corresponding
scaling function, determined with the fixpoint method described earlier.
Figure B.7: The fixpoint approximation of the Daubechies scaling function to order n = 4.
143
wset, 0
tv, bytscl(image)
print, Size of original image is, 512*512L, bytes
; perform wavelet transform with D4 wavlet
wtn_image = wtn(image, 4)
; convert to sparse array with threshold 20 and write to disk
sparse_image = sprsin(wtn_image,thresh=20)
write_spr, sparse_image, sparse.dat
openr, 1, sparse.dat
status = fstat(1)
close, 1
print, Size of compressed image is, status.size, bytes
; reconstruct full array, do inverse wavelet transform and display
wset,1
tv, bytscl(wtn(fulstr(sparse_image), 4, /inverse))
end
B.7
In the case of the Haar wavelets we were able to carry out the wavelet transformation with
vectors and matrices. In general, we cant represent scaling functions in this way. In fact
usually all that we have to work with are the refinement coefficients. So how can we perform
the wavelet transformation? To answer this question, consider a row of pixels
(s(0), s(1) . . . s(m 1))
in a satellite image, where m = 2n , and the associated vector signal on [0, 1] given by
s = (s0 , s1 . . . sm1 )> = (s(0/(m 1)), s(1/(m 1)) . . . s(1))> .
In the MRA generated by a scaling function , such as D4 , this signal defines a function
fn (t) Vn on the interval [0, 1] according to
fn (t) =
m1
X
j=0
sj n,j =
m1
X
sj (2n t j).
(B.21)
j=0
Assume that the basis functions are appropriately normalized. The projection of fn (t) onto
Vn1 is then
X
m/21
fn1 (t) =
k=0
m/21
k=0
where
>
Hs = hfn , (2n1 t)i, hfn , (2n1 t 1)i . . . hfn , (2n1 t m/2 1)i
144
is the signal vector in Vn1 . The operator H is interpreted as a low-pass filter. It averages
the original signal s and reduces its length by a factor of two. We have, using (B.21),
(Hs)k =
m1
X
j=0
so we can write
(Hs)k =
m1
X
sj
j=1
Therefore
(Hs)k =
m1
X
k0
j=1
m1
X
sj
k0
hj2k sj ,
k = 0...
j=0
m
1 = 2n1 1.
2
(B.22)
1+ 3
3+ 3
3 3
1 3
, h1 =
, h2 =
, h3 =
, h4 = 0, . . . .
h0 =
4 2
4 2
4 2
4 2
Thus the elements of the filtered signal are
(Hs)0 = h0 s0 + h1 s1 + h2 s2 + h3 s3
(Hs)1 = h0 s2 + h1 s3 + h2 s4 + h3 s5
(Hs)3 = h0 s4 + h1 s5 + h2 s6 + h3 s7
..
.
This is just the convolution of the filter H = (h3 , h2 , h1 , h0 ) with the signal s,
Hs = H s,
see Eq. (2.12), except that only every second term is retained. This is referred to as
downsampling and is illustrated in Figure B.8.
In the same way, the high-pass filter G projects fn (t) onto the orthogonal subspace Vn1
according to
m1
X
m
gj2k sj , k = 0 . . .
1 = 2n1 1.
(B.23)
(Gs)k =
2
j=0
Recall that
gk = (1)k h1k
145
2
Hs
Figure B.8: Schematic representation of the filter H. The symbol 2 indicates downsampling
by a factor of two.
s1
d1
m/21
(H s1 )k =
hk2j s1j ,
k = 0 . . . m 1 = 2n 1,
(B.24)
gk2j d1j ,
k = 0 . . . m 1 = 2n 1,
(B.25)
j=0
m/21
1
(G d )k =
j=0
with analagous definitions for the other stages. To understand whats happening, consider
146
s1
H s1
Figure B.10: Schematic representation of the filter H . The symbol 2 indicates upsampling
by a factor of two.
Equation (B.25) is interpreted in a similar way. Finally we add the two results to get
the original signal:
H s1 + G d1 = s.
To see this, write the equation out for a particular value of k:
X
m1
X
m/21
(H s1 )k + (G d1 )k =
hk2j
hj 0 2j sj 0 + gk2j
j 0 =0
j=0
m1
X
gj 0 2j sj 0
j 0 =0
m1
X
m/21
sj 0
j 0 =0
[hk2j hj 0 2j + gk2j gj 0 2j ].
j=0
m1
X
j 0 =0
m/21
sj 0
j=0
147
With the help of (B.5) and (B.6) it is easy to show that the second summation above is just
j 0 k . For example, suppose k is even. Then
X
m/21
j=0
0
from (B.5) and hk = ck / 2. For any other value of j 0 , the expression is zero. Therefore we
can write
m1
X
(H s1 )k + (G d1 )k =
sj 0 j 0 k = sk ,
j 0 =0
as claimed. The reconstruction of the original signal from s1 and d1 is shown in Figure B.11
as a synthesis bank.
s1
d1
148
fg1 = fltarr(256,256)
gf1 = fltarr(256,256)
gg1 = fltarr(256,256)
; read a bitmap image and cut out a 512x512 pixel array
filename = Dialog_Pickfile(Filter=*.bmp,/Read)
image = Read_BMP(filename)
; 24 bit image, so get first layer
f0[*,*] = image[1,0:511,0:511]
; display cutout
window,0,xsize=512,ysize=512
wset, 0
tv, bytscl(f0)
; filter columns and downsample
ds = findgen(256)*2
for i=0,511 do begin
temp = convol(transpose(f0[i,*]),H,center=0,/edge_wrap)
f1[i,*] = temp[ds]
temp = convol(transpose(f0[i,*]),G,center=0,/edge_wrap)
149
150
Appendix C
C.1
2 E(w)
.
wi wj
(C.1)
It is the (symmetric) matrix of second order partial derivatives of the cost function E(w)
with respect to the synaptic weights, the latter thought of as a single column vector
h
w1
..
.
h
w
Lo
w=
w1
.
..
o
wM
151
152
v> Hv = v>
i i ui =
i2 i ,
and we conclude that H is positive definite if and only of all of its eigenvalues i are positive.
Thus a good way to check if one is at or near a local minimum in the cost function is to
examine the eigenvalues of the Hessian.
The scaled conjugate gradient algorithm makes explicit use of the Hessian matrix for
more efficient convergence to a minimum in the cost function. The disadvantage of using H
is that it is difficult to compute efficiently. For example, for a typical classification problem
with N = 3-dimensional input data, L = 8 hidden neurons and M = 12 land use categories,
there are
[L(N + 1) + M (L + 1)]2 = 19, 600
matrix elements to determine at each iteration. We develop in the following an efficient
method to calculate not H directly, but rather the product v> H for any vector v having
nw components. Our approach follows Bishop [Bis95] closely.
C.1.1
The R-operator
Let us begin by summarizing some results of Chapter 10 for the two-layer, feed forward
network:
x0> = (x1 . . . xN ) input observation
y> = (0 . . . 1 . . . 0)
>
0>
x = (1, x )
h
I =W
0
h>
n = g (I )
I =W
o>
class label
(C.2)
m = g (I )
1
h
j = 1 . . . L,
(C.3)
k = 1 . . . M.
(C.4)
1 + eIj
(Iko )
eIk
= PM
k0 =1
eIk0
The first derivatives of the local cost function with respect to the output and hidden weights,
(10.19) and (10.22), can be written concisely as
E
= n o>
Wo
E
= x h> ,
Wh
(C.5)
o = y m
(C.6)
where
153
0
h
= n (1 n) Wo o .
(C.7)
,
w
Obviously we have
Rv {w} =
vj
w
= v.
wj
We adopt the convention that the result of applying the R-operator has the same structure
as the argument to which it is applied. Thus for example
Rv {Wh } = Vh ,
where Vh , like Wh , is an (N + 1) L matrix consisting of the first (N + 1) L components
of the vector v.
Next we derive an expression for v> H in terms of the R-operator.
(v> H)j =
nw
X
vi Hij =
i=1
(v H)j = v
w
>
E
wj
w
X
2E
=
vi
wi wj
w
i
i=1
vi
i=1
or
>
nw
X
= Rv
E
wj
E
wj
,
j = 1 . . . nw .
E
w>
Rv
E
Wh
, Rv
E
Wo
.
(C.8)
Note the reorganization of the structure in the argument of Rv , namely w> (Wh , Wo ).
This is merely for convenience. Once the expressions on the right have been evaluated, the
result must be flattened back to a row vector. Equation (C.1.1) is understood to involve
the local cost function. In order to complete the calculation we must sum over all training
pairs.
Applying the chain rule to (C.5),
E
= nRv { o> } RV {n} o>
Wo
E
Rv
= xRv { h> },
Wh
Rv
(C.9)
154
Determination of Rv {n}
From (C.2) we can write
Rv {n} =
0
Rv {n0 }
(C.10)
(C.11)
Rv {Ih } = Vh> x.
(C.12)
and
is interpreted as an L (N + 1)-dimensional
Note that, according to our convention, V
matrix, since the argument Ih is a vector of length L.
h>
Determination of Rv { o }
With (C.6) and (C.2) we get
Rv { o } = Rv {m} = v>
m
= g o 0 (Io ) Rv {Io },
w
(C.13)
(C.14)
W
Vo o
+
h
Rv { }
g h (Ih )
Rv {Ih }
g h (Ih )
0
0
+
Wo Rv { o }.
g h (Ih )
Now we use the derivatives of the activation function
0
g h (Ih ) = n0 (1 n0 )
00
C.1.2
155
To calculate the Hessian matrix for the neural network, we evaluate (C.1.1) successively for
the vectors
v1> = (1, 0, 0 . . . 0) . . . vn>w = (0, 0, 0 . . . 1)
and build up H row for row:
v1> H
H = ... .
vn>w H
The following excerpt from the IDL program FFNCG DEFINE (see Appendix D) implements a vectorized version of the preceding calculation of v> H and H:
Function FFNCG::Rop, V
nw = self.LL*(self.NN+1)+self.MM*(self.LL+1)
; reform V to dimensions of Wh and Wo and transpose
VhT = transpose(reform(V[0:self.LL*(self.NN+1)-1],self.LL,self.NN+1))
Vo = reform(V[self.LL*(self.NN+1):*],self.MM,self.LL+1)
VoT = transpose(Vo)
; transpose the weights
WhT = transpose(*self.Wh)
Wo = *self.Wo
WoT = transpose(Wo)
; vectorized forward pass
X = *self.Xs
Zeroes = fltarr(self.p)
Ones = Zeroes + 1.0
N = [[Ones],[1/(1+exp(-WhT##X))]]
Io = WoT##N
maxIo = max(Io,dimension=2)
for k=0,self.MM-1 do Io[*,k]=Io[*,k]-maxIo
A = exp(Io)
sum = total(A,2)
M = fltarr(self.p,self.MM)
for k=0,self.MM-1 do M[*,k] = A[*,k]/sum
; evaluation of v^T.H
D_o = *self.Ys-M
; d^o
RIh = VhT##X
; Rv{I^h}
RN = N*(1-N)*[[Zeroes],[RIh]]
; Rv{n}
RIo = WoT##RN + VoT##N
; Rv{I^o}
Rd_o = -M*(1-M)*RIo
; Rv{d^o}
Rd_h = N*(1-N)*((1-2*N)*[[Zeroes],[RIh]]*(Wo##D_o) + Vo##D_o + Wo##Rd_o)
Rd_h = Rd_h[*,1:*]
; Rv{d^h}
REo = -N##transpose(Rd_o)-RN##transpose(D_o) ; Rv{dE/dWo}
REh = -X##transpose(Rd_h)
; Rv{dE/dWh}
return, [REh[*],REo[*]]
; v^T.H
End
156
Function FFNCG::Hessian
nw = self.LL*(self.NN+1)+self.MM*(self.LL+1)
v = diag_matrix(fltarr(nw)+1.0)
H = fltarr(nw,nw)
for i=0,nw-1 do H[*,i] = self -> Rop(v[*,i])
return, H
End
C.2
The backpropagation algorithm of Chapter 10 attempts to minimize the cost function locally,
that is, weight updates are made immediately after presentation of a single training pair to
the network. We will now consider a global approach aimed at minimization of the full cost
function (10.15), which we denote in the following E(w). The symbol w is, as before, the
nw -component vector of synaptic weights.
Now let the gradient of the cost function at the point w be g(w), i.e.
g(w)
E(w),
wi
2 E(w)
wi wj
i = 1 . . . nw .
i, j = 1 . . . nw
C.2.1
g(w)> .
w
(C.16)
Conjugate directions
The search for a minimum in the cost function can be visualized as a series of points in the
space of synaptic weight parameters,
w1 , w2 . . . wk1 , wk , wk+1 . . . ,
whereby the point wk is determined by minimizing E(w) along some search direction dk1
which originated at the preceding point wk1 . This is illustrated in Figure C.1 and corresponds to the vector equation
wk = wk1 + k1 dk1 .
(C.17)
Here dk1 is a unit vector along the chosen search direction and the scalar k1 minimizes
the cost function along that direction:
k1 = arg min E wk1 + dk1 .
(C.18)
If, starting from wk , we now wish to take the next minimizing step in the weight space,
it is not efficient simply to choose, as in backpropagation, the direction of the local gradient
g(wk ) at the new starting point wk . It follows namely from (C.18) that
E wk1 + dk1 =
=0
k1
wk1
k1
dk1
157
*
dk ?
R
wk
g(wk )
?
or
>
(C.19)
The gradient g(wk ) at the new point wk is thus always orthogonal to the preceding search
direction dk1 . This is indicated in Figure C.1. Since the algorithms has just succeeded
in reducing the gradient of the cost function along dk1 to zero, we would prefer to choose
the search direction dk so that the component of the gradient along the old search direction
remains as small as possible. Otherwise we are undoing what we have just accomplished.
Therefore we choose dk according to the condition
g wk + dk
>
dk1 = 0.
>
= g(wk )> + dk
>
>
g(wk )> = g(wk )> + dk H
w
dk Hdk1 = 0.
(C.20)
C.2.2
Of course the neural network cost function is not quadratic in the synaptic weights. However
within a sufficiently small region of weight space it can be approximated as a quadratic
function. We describe in the following an efficient procedure to find the global minimum of
a quadratic function of w having the general form
1
E(w) = E0 + b> w + w> Hw,
2
(C.21)
158
E(w) = b + Hw,
w
g(w) =
and at the global minimum w ,
b + Hw = 0.
(C.22)
dk Hd` = 0
for k 6= `, k, ` = 1 . . . nw .
(C.23)
The search directions dk are linearly independent. In order to demonstrate this let us assume
the contrary, that is, that there exists an index k and constants k0 , k 0 6= k, not all of which
are zero, such that
nw
X
0
dk =
k0 dk .
k0 =1
k0 6=k
0>
for k 0 6= k
Hdk = 0
for k 0 6= k.
The assumption thus leads to a contradiction and the dk are indeed linearly independent.
The conjugate directions thus constitute a (non-orthogonal) vector basis for the entire weight
space.
In the search for the global minimum suppose we begin at an arbitrary point w1 and
express the vector w w1 spanning the distance to the global minimum as a linear combination of the basis vectors dk :
nw
X
w w1 =
k dk .
(C.24)
k=1
Further, define
wk = w1 +
k1
X
` d`
(C.25)
`=1
k = 1 . . . nw .
(C.26)
At the kth step the search starts at the point wk and proceeds a distance k along the
conjugate direction dk . After nw such steps the global minimum w is reached, since from
(C.24C.26) it follows that
w = w1 +
nw
X
k=1
1 It
can be shown that such a set always exists, see e.g. [Bis95].
159
>
We get the necessary step sizes k from (C.24) by multiplying from the left with d` H,
>
>
d` Hw d` Hw1 =
nw
X
>
k d` Hdk .
k=1
>
d` (b + Hw1 ) = ` d` Hd` ,
and an explicit formula for the step sizes is given by
>
` =
d` (b + Hw1 )
>
d` Hd`
` = 1 . . . nw .
>
dk Hwk = dk Hw1 + 0,
and therefore, replacing index k by `,
>
>
d` Hw` = d` Hw1 .
The step lengths are thus
>
` =
d` (b + Hw` )
>
d` Hd`
` = 1 . . . nw .
k =
dk g k
>
dk Hdk
k = 1 . . . nw .
(C.27)
For want of a better alternative we can choose the first search direction along the negative
local gradient
E(w1 ).
d1 = g1 =
w
(Note that d1 is not a unit vector.) We move according to (C.27) a distance
>
1 =
d1 d1
d1 > Hd1
along this direction to the point w2 , at which the local gradient g2 is orthogonal to d1 . We
then choose the new conjugate search direction d2 as a linear combination of the two:
d2 = g2 + 1 d1
or, at the kth step,
dk+1 = gk+1 + k dk .
(C.28)
160
We get the coefficient k from (C.28) and (C.20) by multiplication on the left with dk H:
>
>
0 = dk Hgk+1 + k dk Hdk ,
from which follows
>
k =
gk+1 Hdk
>
dk Hdk
(C.29)
C.2.3
The algorithm
Returning now to the non-quadratic neural net cost function E(w) we will apply the above
method to minimize it. We must take two things into consideration.
First of all, the Hessian matrix H is neither constant nor everywhere positive definite.
We will denote its local value at the point wl as Hk . When Hk is not positive definite it
can happen that (C.27) leads to a step along the wrong direction the numerator might
turn out to be negative. Therefore we replace (C.27) with2
>
k =
dk g k
>
dk Hdk + k |dk |2
k = 1 . . . nw .
(C.30)
The constant k is supposed to ensure that the denominator in (C.30) is always positive. It
is initialized for k = 1 with a small numerical value. If, at the kth iteration, it is determined
that
>
k := dk Hdk + k (dk )2 < 0,
k given by
then k is replaced by the larger value
k
k = 2 k k 2 .
|d |
(C.31)
This ensures that the denominator in (C.30) becomes positive again. Note that this increase
in k has the effect of decreasing the step size k , as is apparent from (C.30).
Second, we must take into account any deviation of the cost function from its local
quadratic approximation. Such deviations are to be expected for large step sizes k . As a
measure of the quadricity of E(w) along the chosen step length we can use the ratio
k =
2 This
2 E(wk ) E(wk + k dk )
>
k dk gk
(C.32)
161
This quantity is precisely 1 for a strictly quadratic function like (C.21). Therefore we can
use the following heuristic: For the k + 1st iteration
if k > 3/4,
k+1 := k /2
if k < 1/4,
k+1 := 4k
else,
k+1 := k .
In other words, if the local quadratic approximation looks good according to criterion (C.32),
then the step size can be increased (k+1 is reduced relative to k ). If this is not the case
then the step size is decreased (k+1 is made larger).
All of which leads us finally to the following algorithm (see e.g. [Moe93])
Algorithm (Scaled Conjugate Gradient)
1. Initialize the synaptic weights w with random numbers, set k = 0, = 0.001 and
d = g = E(w)/w.
2. Set = d> Hd + |d|2 . If < 0, set = 2( /d2 ) and = d> Hd. Save the
current cost function E1 = E(w).
3. Determine the step size = d> g/ and new synaptic weights w = w + d.
4. Calculate the quadricity = (E1 E(w))/( d> g). If < 1/4, restore the old
weights: w = w d, set = 4, d = g and go to 2.
5. Set k = k + 1. If > 3/4 set = /2.
6. Determine the new local gradient g = E(w)/w and the new search direction d =
g + d, whereby, if k mod nw 6= 0 then = g> Hd/(d> Hd) else = 0.
7. If E(w) is small enough stop, else go to 2.
A few remarks on this algorithm:
The integer k counts the total number of iterations. Whenever k mod nw = 0 exactly
nw weight updates have been carried out and the minimum of a truly quadratic function would have been reached. This is taken as a good stage at which to restart the
search along the negative local gradient g rather than continuing along the current
conjugate direction d. One expects that approximation errors will gradually corrupt
the determination of the conjugate directions and the fresh start is intended to
counter this.
Whenever the quadricity condition is not filled, i.e. whenever < 1/4, the last weight
update is cancelled and the search again restarted along g.
Since the Hessian only occurs in the forms d> H, and g> H, it can be determined
efficiently with the R-operator method.
Here is an excerpt from the object FFNCG class extending FFN, showing the training
method which implements scaled conjugate gradient algorithm:
162
Pro FFNCG::Train
w = [(*self.Wh)[*],(*self.Wo)[*]]
nw = n_elements(w)
g = self->gradient()
d = -g
; search direction, row vector
k = 0L
lambda = 0.001
window,12,xsize=600,ysize=400,title=FFN(scaled conjugate gradient)
wset,12
progressbar = Obj_New(progressbar, Color=blue, Text=0,$
title=Training: epoch number...,xsize=250,ysize=20)
progressbar->start
eivminmax = ?
repeat begin
if progressbar->CheckCancel() then begin
print,Training interrupted
progressbar->Destroy
return
endif
d2 = total(d*d)
; d^2
dTHd = total(self->Rop(d)*d)
; d^T.H.d
delta = dTHd+lambda*d2
if delta lt 0 then begin
lambda = 2*(lambda-delta/d2)
delta = -dTHd
endif
E1 = self->cost()
; E(w)
(*self.cost_array)[k] = E1
dTg = total(d*g)
; d^T.g
alpha = -dTg/delta
dw = alpha*d
w = w+dw
*self.Wh = reform(w[0:self.LL*(self.NN+1)-1],self.LL,self.NN+1)
*self.Wo = reform(w[self.LL*(self.NN+1):*],self.MM,self.LL+1)
; E(w+dw)
E2 = self->cost()
Ddelta = -(E1-E2)/(alpha*dTg)
; quadricity
if Ddelta lt 0.25 then begin
w = w - dw
; undo change in the weights
*self.Wh = reform(w[0:self.LL*(self.NN+1)-1],self.LL,self.NN+1)
*self.Wo = reform(w[self.LL*(self.NN+1):*],self.MM,self.LL+1)
lambda = 4*lambda
; decrease step size
d = -g
; restart along gradient
end else begin
k++
if Ddelta gt 0.75 then lambda = lambda/2
g = self->gradient()
if k mod nw eq 0 then begin
beta = 0
eivs = self->eigenvalues()
eivminmax = string(min(eivs)/max(eivs),format=(F10.6))
163
C.3
In this Section we apply the recursive least squares method described in Appendix A to
train the feed forward neural network of Figure 10.4. The appropriate cost function is the
quadratic function (10.13) or, more specifically, its local version (10.14).
0 n(` + 1)
wko
1
1
.. n1 (` + 1) ~
m(` + 1)
q
.
: k
>
j
nj (` + 1)
..
.
L nL (` + 1)
ewk
mk (` + 1) = g(wko> n(` + 1)) = PM
k0 =1
n(`+1)
o> n(`+1
ewk0
k = 1 . . . M,
which is compared to the desired output y(` + 1). It is easy to show that differentiation with
respect to wko yields
(C.33)
(C.34)
164
C.3.1
Linearization
We shall drop for the time being the indices on wko , writing it simply as w. Let us call w(`)
an approximation to the desired synaptic weight vector for our isolated output neuron, one
which has been achieved so far in the training process, i.e. after presentation of the first `
training pairs. Then a linear approximation to mk (` + 1) can be obtained by expanding in
a first order Taylor series about the point w(`),
m(` + 1) g(w(`)> n(` + 1)) +
>
(C.35)
where m(`
+ 1) is given by
m(`
+ 1) = g(w(`)> n(` + 1)).
With the definition of the linearized input
A(` + 1) = m(`
+ 1)(1 m(`
+ 1))n(` + 1)>
(C.36)
while the recursive expression (A.14) for the parameter vector becomes
w(` + 1) = w(`) + K`+1 y(` + 1) A(` + 1)w(`) .
(C.37)
165
This can be improved somewhat by replacing the linear approximation to the system output
A(` + 1)w(`) by the actual output for the ` + 1st training observation, namely m(`
+ 1), so
we have
w(` + 1) = w(`) + K`+1 y(` + 1) m(`
+ 1) .
C.3.2
(C.38)
The algorithm
The recursive calculation of w is depicted in Figure C.3. The input is the current weight
vector w(`), its covariance matrix ` and the output vector of the hidden layer n(` + 1)
obtained by propagating the next input observation x(` + 1) through the network. After
determining the linearized input A(` + 1), Eq. (C.36), the Kalman gain K`+1 and the new
covariance matrix `+1 are calculated with (C.37). Finally, the weights are updated in
(C.38) to give w(` + 1) and the procedure is repeated.
n(` + 2)
n(` + 1)
? A(` + 1)K`+1
- C.37
C.36
>
6
y(` + 1)
? A(` + 2)
C.36
3
?
w(`
+ 1)
- C.38
w(`)
`+1
`
Figure C.3: Determination of the synaptic weights for an isolated neuron with the Kalman
filter.
To make our notation explicit for the output neurons, we substitute
y(`) yk (`)
w(`) wko (`)
m(`
+ 1) m
k (` + 1) = g wko> (`)n(` + 1)
>
A(` + 1) Aok (` + 1) = m
k (` + 1)(1 m
k (` + 1))n(` + 1)
K` Kok (`)
` ok (`),
for k = 1 . . . M . Then (C.38) becomes
wko (` + 1) = wko (`) + Kok (` + 1) y(` + 1) m
k (` + 1) ,
k = 1 . . . M.
(C.39)
Recalling that we wish to minimize the local quadratic cost function E(`) given by Eq.
(10.14), note that the expression in square brackets above is in fact the negative derivative
166
E(`)
mk (`)
so that
wko (`
+ 1) =
wko (`)
Kok (`
E(`)
+ 1)
mk (`)
.
(C.40)
m
k (`+1)
With this result, we can turn consideration to the hidden neurons, making the substitutions
w(`) wjh (`)
m(`
+ 1) n
j (` + 1) = g wjh> (`)x(` + 1)
>
A(` + 1) Ahj (` + 1) = n
j (` + 1)(1 n
j (` + 1))x(` + 1)
K` Khj (`)
` hj (`),
for j = 1 . . . L. Then, analogously to (C.40), the update equation for the weight vector of
the jth hidden neuron is
E(` + 1)
h
h
h
wj (` + 1) = wj (`) Kj (` + 1)
.
(C.41)
nj (` + 1) n j (`+1)
To obtain the partial derivative in (C.41), we differentiate the cost function (10.14)
X
mk ( + 1)
E(` + 1)
(yk (` + 1) mk (` + 1))
=
.
nj (` + 1)
nj ( + 1)
M
k=1
o
, we have
From (C.34), noting that (wko )j = Wjk
mk (` + 1)
o
= mk (` + 1)(1 mk (` + 1))Wjk
(` + 1)
nj (` + 1)
Combining the last two equations,
X
E(` + 1)
o
=
(yk (` + 1) mk (` + 1))mk (` + 1)(1 mk (` + 1))Wjk
(` + 1)
nj (` + 1)
M
k=1
(C.42)
o
where Wj
is the jth row of the output layer weight matrix, and where
(C.43)
ok (0) = ZIo ,
Z 1, j = 1 . . . L, k = 1 . . . M,
167
Ahj (` + 1) = n
j (` + 1)(1 n
j (` + 1))x(` + 1) ,
j = 1 . . . L,
m
k (` + 1) = g wko> (`)
n(` + 1)
>
Aok (` + 1) = m
k (` + 1)(1 m
k (` + 1))
n(` + 1) ,
k = 1...M
and
+ 1)) m(`
+ 1) (1 m(`
+ 1)).
o (` + 1) = (y(` + 1) m(`
3. Determine the Kalman gains for all of the neurons according to
1
>
Aok (` + 1)ok (`)Aok (` + 1) + 1 ,
1
>
>
Khk (` + 1) = hj (`)Ahj (` + 1) Ahj (` + 1)hj (`)Ahj (` + 1) + 1 ,
Kok (` + 1) = ok (`)Aok (` + 1)
>
k = 1...M
j = 1...L
k = 1...M
j = 1...L
6. If the overall cost function (10.13) is sufficiently small, stop, else set ` = ` + 1 and go
to 2.
This method was originally suggested by Shah and Palmieri [SP90], who called it the multiple extended Kalman algorithm (MEKA). Here is an excerpt from the object FFNKAL class
extending FFN, showing the class method which implements the Kalman filter algorithm:
Pro FFNKAL:: train
; define update matrices for Wh and Wo
dWh = fltarr(self.LL,self.NN+1)
dWo = fltarr(self.MM,self.LL+1)
iter = 0L
iter100 = 0L
progressbar = Obj_New(progressbar, Color=blue, Text=0,$
title=Training: exemplar number...,xsize=250,ysize=20)
168
;
;
;
;
;
;
;
;
;
;
;
;
;
;
progressbar->start
window,12,xsize=600,ysize=400,title=FFF(Kalman filter)
wset,12
repeat begin
if progressbar->CheckCancel() then begin
print,Training interrupted
progressbar->Destroy
return
endif
select exemplar pair at random
ell = long(self.p*randomu(seed))
x=(*self.Xs)[ell,*]
y=(*self.Ys)[ell,*]
send it through the network
m=self->forwardPass(x)
error at output
e=y-m
loop over the output neurons
for k=0,self.MM-1 do begin
linearized input (column vector)
Ao = m[k]*(1-m[k])*(*self.N)
Kalman gain
So = (*self.So)[*,*,k]
SA = So##Ao
Ko = SA/((transpose(Ao)##SA)[0]+1)
determine delta for this neuron
dWo[k,*] = Ko*e[k]
update its covariance matrix
So = So - Ko##transpose(Ao)##So
(*self.So)[*,*,k] = So
endfor
update the output weights
*self.Wo = *self.Wo + dWo
backpropagated error
beta_o =e*m*(1-m)
loop over the hidden neurons
for j=0,self.LL-1 do begin
linearized input (column vector)
Ah = X*(*self.N)[j+1]*(1-(*self.N)[j+1])
Kalman gain
Sh = (*self.Sh)[*,*,j]
SA = Sh##Ah
Kh = SA/((transpose(Ah)##SA)[0]+1)
determine delta for this neuron
dWh[j,*] = Kh*((*self.Wo)[*,j+1]##beta_o)[0]
update its covariance matrix
Sh = Sh - Kh##transpose(Ah)##Sh
(*self.Sh)[*,*,j] = Sh
endfor
update the hidden weights
169
170
Appendix D
ENVI Extensions
D.1
Installation
171
172
D.2
D.2.1
Topographic modelling
Calculating building heights
CALC HEIGHT is an ENVI extension to determine height of vertical buildings in QuickBird/Ikonos images using rational function models (RFMs) provided with ortho-ready imagery. It is invoked as
Tools/Building Height
from the ENVI display menu.
Usage
Load an RFM file in the CalcHeight window with File/Load RPC File (extension RPC
or RPB). If a DEM is available for the scene, this can also be loaded with File/Load DEM
File. A DEM is not required, however. Click on the bottom of a vertical structure to set
the base height and then shift-click on the top of the structure. Press the CALC button
to display the structures height, latitude, longitude and base elevation. The number in
brackets next to the height is the minimum distance (in pixels) between the top pixel and a
vertical line through the bottom pixel. It should be of the order of 1 or less.
If no DEM is loaded, the base elevation is the average value for the whole scene. If
a DEM is used, the base elevation is taken from it. The latitude and longitude are then
orthorectified values.
Source headers
;+
; NAME:
;
CALCHEIGHT
; PURPOSE:
;
Determine height (and lat, long, elevation) of vertical buildings
;
in QuickBird/Ikonos images using RPCs
; AUTHOR;
;
Mort Canty (2004)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
CalcHeight
; ARGUMENTS:
;
Event (if used as a plug-in menu item)
; KEYWORDS:
None
;
; COMMON BLOCKS:
;
Shared, RPC, Cb, Rb, Ct, Rt, elev
;
Cursor_Motion_C, dn, Cbtext, Rbtext, Cttext, Rttext
;
RPC: structure with RPC camera model
;
Cb, Rb: coordinates of building base
;
Ct, Rt: coordinates of building top
;
elev: elevation of base
;
dn: display number
173
;
Cbtext ... : Edit widgets
; DEPENDENCIES:
;
ENVI
;
CURSOR_MOTION
; -------------------------------------------------------------
;+
; NAME:
;
CURSOR_MOTION
; PURPOSE:
;
Cursor communication with ENVI image windows
; AUTHOR;
;
Mort Canty (2004)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
Cursor_Motion, dn, xloc, yloc, xstart=xstart, ystart=ystart, event=event
; ARGUMENTS:
;
dn: display number
;
xloc,yloc: mouse position
; KEYWORDS
;
xstart, ystart: display origin
;
event: mouse event
; COMMON BLOCKS:
;
Cursor_Motion_C, dn, Cbtext, Rbtext, Cttext, Rttext
; DEPENDENCIES:
;
None
;--------------------------------------------------------------------------
D.2.2
Illumination correction
C CORRECTION is an ENVI extension for local illumination correction for multispectral images. It is invoked from the ENVI main menu as
Topographic/Illumination Correction.
Usage
From the Choose image for correction menu select the (spectral/spatial subset of the)
image to be corrected. Then in the C-correction parameters box enter the solar elevation
and azimuth in degrees and, if desired, a new size for the kernel used for slope/aspect
determination (default 99). In the Choose digital elevation file window select the
corresponding DEM file. Finally in the Output corrected image box choose an output file
name or select memory.
174
Source headers
;+
; NAME:
;
C_CORRECTION
; PURPOSE:
;
ENVI extension for c-correction for solar illumination in rough terrain
;
Ref: D. Riano et al. IEEE Transactions on
;
Geoscience and Remote Sensing, 41(5) 2003, 1056-1061
; AUTHOR
;
Mort Canty (2004)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
C_Correction
; ARGUMENTS:
;
Event (if used as a plug-in menu item)
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;------------------------------------------------------------------------
D.3
175
Image registration
CONTOUR MATCH is an ENVI extension for determination of ground control points (GCPs) for
image-image registration. It is invoked from the ENVI main menu as
Map/Registration/Contour Matching.
Usage
In the Choose base image band window enter a (spatial subset) of the base image. Then
in the Choose warp image band window select the image to be warped. In the LoG sigma
box choose the size of the Laplacian of Gaussian filter kernel. The default is 25 ( = 2.5).
Finally in the Save GCPs to ASCII menu enter a file name (extension .pts) for the CGPs.
After the calculation, these can then be loaded and inspected in the usual ENVI image-image
registration dialog.
Source headers
;+
; NAME:
;
CONTOUR_MATCH
; PURPOSE:
;
ENVI extension for extraction of ground control points for image-image registration
;
Images may be already georeferenced, in which case GCPs are for "fine adjustement"
;
Uses Laplacian of Gaussian filter and contour tracing to match closed contours
;
Ref: Li et al, IEEE Transactions on Image Processing, 4(3) (1995) 320-334
; AUTHOR
Mort Canty (2004)
;
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
Contour_Match
; ARGUMENTS:
Event (if used as a plug-in menu item)
;
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
CI_DEFINE
;
PROGRESSBAR_DEFINE (FSC_COLOR)
;-----------------------------------------------------------------------;+
; NAME:
;
CI__DEFINE
; PURPOSE:
;
Find thin closed contours in an image band with combined Sobel-LoG filtering
;
Ref: Li et al, IEEE Transactions on Image Processing, 4(3) (1995) 320-334
; AUTHOR
;
Mort Canty (2004)
176
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
D.4
D.4.1
177
Image fusion
DWT fusion
ARSIS DWT is an ENVI extension for panchromatic sharpening with the discrete wavelet
transform (DWT). It is invoked from the ENVI main menu as
Transform/Image Sharpening/Wavelet(ARSIS Model)/DWT
Usage
In the Select low resolution multi-band input file window choose the (spatial/spectral
subset of the) image to be sharpened. In the Select hi res input band window choose the
corresponding panchromatic or high resolution image. Then in the ARSIS Fusion Output
box select an output file name or memory.
Source headers
;+
; NAME:
;
ARSIS_DWT
; PURPOSE:
;
ENVI extension for panchromatic sharpening under ARSIS model
;
with Mallats discrete wavelet transform and Daubechies wavelets
;
Ref: Ranchin and Wald, Photogramm. Eng. Remote. Sens.
;
66(1), 2000, 49-61
; AUTHOR
;
Mort Canty (2004)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
ARSIS_DWT
; ARGUMENTS:
;
Event (if used as a plug-in menu item)
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
DWT__DEFINE(PHASE_CORR)
;
ORTHO_REGRESS
;-----------------------------------------------------------------------;+
; NAME:
;
DWT__DEFINE
; PURPOSE:
;
Discrete wavelet transform class using Daubechies wavelets
;
for construction of pyramid representations of images, fusion etc.
;
Ref: T. Ranchin, L. Wald, Photogammetric Engineering and
;
Remote Sensing 66(1) (2000) 49-61.
178
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
AUTHOR
Mort Canty (2004)
Juelich Research Center
m.canty@fz-juelich.de
CALLING SEQUENCE:
dwt = Obj_New("DWT",image)
ARGUMENTS:
image: grayscale image to be compressed
KEYWORDS
None
METHODS:
SET_COEFF: choose the Daubechies wavelet
dwt -> Set_Coeff, n
n = 4,6,8,12
SHOW_IMAGE: display the image pyramid in a window
dwt -> Show_Image, wn
INJECT: overwrite upper left quadrant
after phase correlation match if keyword pc is set (default)
dwt -> Inject, array, pc = pc
SET_COMPRESSIONS: set the number of compressions
dwt -> Set_Compressions, nc
GET_COMPRESSIONS: get the number of compressions
nc = dwt -> Get_Compressions()
GET_NUM_COLS: get the number of columns in the compressed image
cols = dwt -> Get_Num_Cols()
GET_NUM_ROWS: get the number of rows in the compressed image
cols = dwt -> Get_Num_Rows()
GET_IMAGE: return the pyramid image
im = dwt -> Get_Image()
GET_QUADRANT: get compressed image (as 2D array) or innermost
wavelet coefficients as vector
wc = dwt -> Get_Quadrant(n)
n = 0,1,2,3
NORMALIZE_WC: normalize wavelet coefficients at all levels
dwt -> Normalize, a, b
a, b are normalization parameters
COMPRESS: perform a single compression
dwt -> Compress
dwt -> Inject, array, pcr=pc
EXPAND: perfrom a single expansion
dwt -> Expand
DEPENDENCIES:
PHASE_CORR
---------------------------------------------------------------------
;+
; NAME:
;
PHASE_CORR
; PURPOSE:
;
Returns relative offset [xoff,yoff] of two images using phase correlation
179
;
Maximum offset should not exceed +- 5 pixels in each dimension
;
Returns -1 if dimensions are not equal
;
Ref: H, Shekarforoush et al. INRIA 2707
; AUTHOR
;
Mort Canty (2004)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
shft = Phase_Corr(im1,im2,display=display,subpixel=subpixel)
; ARGUMENTS:
;
im1, im2: the images to be correlated
; KEYWORDS:
;
Display: (optional) show a surface plot if the correlation
;
in window with display number display
;
Subpixel: returns result to subpixel accuracy if set,
;
otherwise nearest integer (default)
; DEPENDENCIES:
;
None
;---------------------------------------------------------------------------
; NAME:
;
ORTHO_REGRESS
; PURPOSE:
;
Orthogonal regression between two vectors
;
Ref: M. Canty et al. Remote Sensing of Environment 91(3,4) (2004) 441-451
; AUTHOR
;
Mort Canty (2004)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
Ortho_Regress, X, Y, a, Xm, Ym, sigma_a, sigma_b
;
regression line is Y = Ym + a(X-Xm) = (Ym-aXm) + aX = b + aX
; ARGUMENTS:
;
input column vectors X and Y
;
returns a, Xm, Ym, sigma_a, sigma_b
; KEYWORDS:
;
None
; DEPENDENCIES:
;
None
;-------------------------------------------------------------------
D.4.2
ATWT fusion
180
Usage
In the Select low resolution multi-band input file window choose the (spatial/spectral
subset of the) image to be sharpened. In the Select hi res input band window choose the
corresponding panchromatic or high resolution image. Then in the ARSIS Fusion Output
box select an output file name or memory.
Source headers
;+
; NAME:
;
ARSIS_ATWT
; PURPOSE:
;
ENVI extension for panchromatic sharpening under ARSIS model
;
with "A trous" wavelet transform.
;
Ref: Aiazzi et al, IEEE Transactions on Geoscience and
;
Remote Sensing, 40(10) 2300-2312, 2002
; AUTHOR
;
Mort Canty (2004)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
ARSIS_ATWT
; ARGUMENTS:
;
Event (if used as a plug-in menu item)
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
ATWT__DEFINE(WARP_SHIFT, PHASE_CORR)
;
ORTHO_REGRESS
;-----------------------------------------------------------------------;+
; NAME:
;
ATWT__DEFINE
; PURPOSE:
;
A Trous wavelet transform class using Daubechies wavelets.
;
Used for shift invariant image fusion
;
Ref: Aiazzi et al. IEEE Transactions on Geoscience and
;
Remote Sensing 40(10) (2002) 2300-2312
; AUTHOR
;
Mort Canty (2004)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
atwt = Obj_New("ATWT",image)
; ARGUMENTS:
;
image: grayscale image to be processed
; KEYWORDS
181
;
None
; METHODS:
;
SHOW_IMAGE: display the image pyramid in a window
;
dwt -> Show_Image, wn
;
INJECT: overwrite the filtered image
;
dwt -> Inject, im
;
SET_TRANSFORMS: set the number of transformations
;
dwt -> Set_Transforms, nc
;
GET_TRANSFORMS: get the number of transformations
;
nc = dwt -> Get_Transforms()
;
GET_NUM_COLS: get the number of columns in the compressed image
;
cols = dwt -> Get_Num_Cols()
;
GET_NUM_ROWS: get the number of rows in the compressed image
;
cols = dwt -> Get_Num_Rows()
;
GET_IMAGE: return filtered image or details
im = dwt -> Get_Image(i)
;
;
i = 0 for filters image, i > 0 for details
;
NORMALIZE_WC: normalize details at all levels
;
dwt -> Normalize, a, b
;
a, b are normalization parameters
;
COMPRESS: perform a single transformation
;
dwt -> Compress
;
EXPAND: perfrom a single reverse transformation
;
dwt -> Expand
; DEPENDENCIES:
;
WARP_SHIFT
;
PHASE_CORR
; ---------------------------------------------------------------------
;+
; NAME:
;
WARP_SHIFT
; PURPOSE:
;
Use RST with bilinear interpolation to shift band to sub-pixel accuracy
; AUTHOR
;
Mort Canty (2004)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
sband = Warp_Shift(band,shft)
; ARGUMENTS:
;
band: the image band to be shifted
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;---------------------------------------------------------------------------
182
D.4.3
Quality index
RUN QUALITY INDEX is an ENVI extension to determine the Wang-Bovik quality index of a
pan-sharpened image. It is invoked from the ENVI main menu as
Transform/Image Sharpening/Quality Index
Usage
From the Choose reference image menu select the multispectral image to which the sharpened image is to be compared. In the Choose pan-sharpened image menu, select the image
whose quality is to be determined.
Source headers
;+
; NAME:
;
RUN_QUALITY_INDEX
; PURPOSE:
;
ENVI extension for radiometric comparison of two multispectral images
;
Ref: Wang and Bovik, IEEE Signal Processing Letters 9(3) 2002, 81-84
; AUTHOR
;
Mort Canty (2004)
Juelich Research Center
;
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
Run_Quality_Index
; ARGUMENTS:
;
Run_Quality_Index, Event (if used as a plug-in menu item)
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
QI
;-------------------------------------------------------------------------;+
; NAME:
;
QI
; PURPOSE:
;
Determine the Wang-Bovik quality index for a pan-sharpened image band
;
Ref: Wang and Bovik, IEEE Signal Processing Letters 9(3) 2002, 81-84
; AUTHOR
;
Mort Canty (2004)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
index = QI(band1,band2)
; ARGUMENTS:
;
band1: reference band
183
184
D.5
D.5.1
Change detection
Multivariate Alteration Detecton
MAD RUN is an ENVI extension for change detection with the MAD transformation. It is
invoked from the ENVI main menu as
Basic Tools/Change Detection/MAD
Usage
From the Choose first image window enter the first (spatial/spectral subset) of the two
image files. In the Choose second image window enter the second image file name. The
spatial and spectral subsets must be identical. If an input image is in BSQ format, it is
converted in place, after a warning, to BIP. In the MAD Output box choose a file name or
memory. The calculation begins and can be interrupted at any time with the Cancel button.
Before output, the spatial subset for the final MAD transformation can be changed, e.g.
extended to a full scene, if desired.
Source headers
;+
; NAME:
;
MAD_RUN
; PURPOSE:
;
ENVI extension for Multivariate Alteration Detection.
;
Ref: A. A. Nielsen et al. Remote Sensing of Environment 64 (1998), 1-19
;
Uses spectral tiling and therefore suitable for large datasets.
;
Reads in two registered multispectral images (spectral/spatial subsets
;
must have the same dimensions, spectral subset size must be at least 2).
;
If an input image is in BSQ format, it is converted in place to BIP.
;
Writes the MAD variates to disk.
; AUTHOR
;
Mort Canty (2004)
;
Juelich Research Center
m.canty@fz-juelich.de
;
; CALLING SEQUENCE:
;
Mad_Run
; ARGUMENTS:
;
Event (if used as a plug-in menu item)
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
MAD_TILED (COVPM_DEFINE, GEN_EIGENPROBLEM)
;
PROGRESSBAR_DEFINE (FSC_COLOR)
;--------------------------------------------------------------------;+
; NAME:
185
;
MAD_TILED
; PURPOSE:
;
Function for Multivariate Alteration Detection.
;
Ref: A. A. Nielsen et al. Remote Sensing of Environment 64 (1998), 1-19
;
Uses spectral tiling and therefore suitable for large datasets.
;
Input files must be BIL or BIP format.
;
On error or if interrupted during the first iteration, returns = -1 else 0
; AUTHOR
;
Mort Canty (2004)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
result = Mad_Tiled(fid1,fid2,dims1,dims2,pos1,pos2)
; ARGUMENTS:
fid1, fid2
input file specifications
;
;
dims1, dims2
;
pos1, pos2
; KEYWORDS:
;
A, B
output: transformation eigenvectors
;
means1, means2
weighted mean values for transformation, row-replicated
;
cp
change probability image from chi-square distribution
; DEPENDENCIES:
;
ENVI
;
COVPM_DEFINE
;
GEN_EIGENPROBLEM
;
PROGRESSBAR_DEFINE (FSC_COLOR)
;--------------------------------------------------------------------;+
; NAME:
;
COVPM__DEFINE
; PURPOSE:
;
Object class for iterative covariance matrix calculation
;
using the method of provisional means.
; AUTHOR
;
Mort Canty (2004)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
covpm = Obj_New("COVPM")
; ARGUMENTS:
;
None
; KEYWORDS
;
None
; METHODS:
;
UPDATE: update the covariance matrix with an observation
;
covpm -> Update, v, weight = w
;
v is an obsevation vector (array)
;
w is an optioanl weight for that observation
;
COVARIANCE: read out the covariance matrix
186
;
cov = covpm -> Covariance()
;
MEANS: read out the observation means
;
mns = covpm -> Means()
; DEPENDENCIES:
;
None
;-------------------------------------------------------------;+
; NAME:
GEN_EIGENPROBLEM
;
; PURPOSE:
;
Solve the generalized eigenproblem
;
C##a = lambda*B##a
;
using Cholesky factorization
; AUTHOR:
;
Mort Canty (2001)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
Gen_Eigenproblem, C, B, A, lambda
; ARGUMENTS:
;
C and B are real, square, symmetric matrices
;
returns the eigenvalues in the row vector lambda
;
returns the eigenvectors a as the columns of A
; KEYWORDS:
;
None
; DEPENDENCIES
;
None
;---------------------------------------------------------------------
D.5.2
MAF is an ENVI extension for performing the MAF transformation, usually on previously
calculated MAD variates. It is invoked from the ENVI main menu as
Basic Tools/Change Detection/MAF (of MAD)
Usage
In the Choose multispectral image box select the file to be transformed. In the MAF
Output box select an output file name or memory.
Source headers
;+
; NAME:
;
MAF
; PURPOSE:
;
ENVI extension for Maximum Autocorrelation Fraction transformation.
;
Ref: Green et al, IEEE Transaction on Geoscience and Remote Sensing,
187
;
26(1):65-74,1988
; AUTHOR
;
Mort Canty (2004)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
Maf
; ARGUMENTS:
;
Event (if used as a plug-in menu item)
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
GEN_EIGENPROBLEM
;---------------------------------------------------------------------
D.5.3
Radiometric normalization
RADCAL is an ENVI extension for radiometric normalization with the MAD transformation.
It is invoked from the ENVI main menu as
Basic Tools/Change Detection/MAD Radiometric Normalization
Usage
From the Choose reference image window enter the first (spatial/spectral subset) of the
two image files. In the Choose target image window enter the second image file name.
The spatial and spectral subsets must be identical. If an input image is in BSQ format, it
is converted in place, after a warning, to BIP. In the MAD Output box choose a file name or
memory. The calculation begins and can be interrupted at any time with the Cancel button.
In a series of plot windows the regression lines used for the normalization are plotted. The
results can be then used to calibrate another file, e.g. a full scene.
Source headers
;+
; NAME:
;
RADCAL
; PURPOSE:
;
Radiometric calibration using MAD
;
Ref: M. Canty et al. Remote Sensing of Environment 91(3,4) (2004) 441-451
;
Reference and target images must have equal spatial and spectral dimensions,
;
at least 2 spectral components, and be registered to one another.
;
Once the regression coefficients have been determined, they can be used to
;
calibrate another file, for example a full scene, which need not be registered
;
to the reference image.
; AUTHOR
;
Mort Canty (2004)
;
Juelich Research Center
;
m.canty@fz-juelich.de
188
; CALLING SEQUENCE:
;
Radcal
; ARGUMENTS:
;
Event (if used as a plug-in menu item)
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
ORTHO_REGRESS
;
MAD_TILED (COVPM_DEFINE, GEN_EIGENPROBLEM)
;
PROGRESSBAR_DEFINE (FSC_COLOR)
;-----------------------------------------------------------------
D.6
D.6.1
189
Unsupervised classification
Hierarchical clustering
190
;
Mort Canty (2004)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
HCL, Xs, K, Cs
; ARGUMENTS:
;
Xs: input observations array (column vectors)
K: number of clusters
;
;
Cs: Cluster labels of observations
; KEYWORDS:
;
None
; DEPENDENCIES:
PROGRESSBAR__DEFINE (FSC_COLOR)
;
;-------------------------------------------------------------------;+
; NAME:
;
CLASS_LOOKUP_TABLE
; PURPOSE:
;
Provide 16 class colors for supervised and unsupervised classification programs
; AUTHOR;
;
Mort Canty (2004)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
colors = Class_Lookup_Table(Ptr)
; ARGUMENTS:
;
Ptr: a vector of pointers into the table
; KEYWORDS:
;
None
; DEPENDENCIES:
;
None
;---------------------------------------------------------------------
D.6.2
SAMPLE FKMRUN is an ENVI extension for fuzzy k-means clustering. It is invoked from the
ENVI main menu as
Classification/Unsupervised/Fuzzy-K-Means
Usage
In the Choose multispectral image window select the (spatial/spectral subset of the)
desired image. In the Number of Classes box select the desired number of clusters. In the
FKM Output box select the output file name or memory.
Source headers
;+
191
; NAME:
;
SAMPLE_FKMRUN
; PURPOSE:
;
ENVI extension for fuzzy K-means clustering with sampled data
; AUTHOR
;
Mort Canty (2004)
;
Juelich Research Center
m.canty@fz-juelich.de
;
; CALLING SEQUENCE:
;
Sample_FKMrun
; ARGUMENTS:
;
Event (if used as a plug-in menu item)
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
FKM (PROGRESSBAR_DEFINE (FSC_COLOR))
;
CLUSTER_FKM
;
CLASS_LOOKUP_TABLE
;--------------------------------------------------------------------;+
; NAME:
;
FKM
; PURPOSE:
;
Fuzzy Kmeans clustering algorithm.
;
Takes data array Xs (data as column vectors), number of clusters K.
;
Returns fuzzy membership matrix U and the class centers Ms.
;
Ref: J. C. Dunn, Journal of Cybernetics, PAM1:32-57, 1973
; AUTHOR
;
Mort Canty (2004)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
FKM, Xs, K, U, Ms, niter=niter, seed=seed
; ARGUMENTS:
;
Xs: input observations array (column vectors)
;
K: number of clusters
;
U: final class probability membership matrix (output)
;
Ms: cluster means (output)
; KEYWORDS:
;
niter: number of iterations (optional)
;
seed: initial random number seed (optional)
; DEPENDENCIES:
;
PROGRESSBAR__DEFINE (FSC_COLOR)
;-------------------------------------------------------------------;+
; NAME:
;
CLUSTER_FKM
192
; PURPOSE:
;
Modified distance clusterer from IDL library
; CALLING SEQUENCE:
;
labels = Cluster_fkm(Array,Weights,Double=Double,N_clusters=N_clusters)
;-------------------------------------------------------------------------
D.6.3
EM clustering
SAMPLE EMRUN is an ENVI extension for EM clustering. It is invoked from the ENVI main
menu as
Classification/Unsupervised/EM(Sampled)
TILED EMRUN can be used to cluster large data sets. It is invoked from the ENVI main menu
as
Classification/Unsupervised/EM(Tiled)
Usage
In the Choose multispectral image for clustering window select the (spatial/spectral
subset of the) desired image. In the Number of Samples box choose the size of the representative random sample (default 1000). In the Number of Classes box select the desired
number of clusters. In the FKM Output box select the output file name or memory. In
the Output class membership probs box select the output file name for the probabilities
(rule) image, or Cancel if this is not desired. The rule image will be byte coded (0 = probability 0, 255 = probability 1). In the tiled version, output to memory is not possible. During
calculation a log likelihood plot is shown. Calculation can be interrupted at any time.
Source headers
;+
; NAME:
;
SAMPLE_EMRUN
; PURPOSE:
;
ENVI extension for EM clustering with sampled data
; AUTHOR
;
Mort Canty (2004)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
Sample_EMrun
; ARGUMENTS:
;
Event (if used as a plug-in menu item)
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
EM (PROGRESSBAR__DEFINE (FSC_COLOR))
;
CLUSTER_EM
;
CLASS_LOOKUP_TABLE
193
;--------------------------------------------------------------------;+
; NAME:
;
TILED_EMRUN
; PURPOSE:
;
ENVI extension for EM clustering on sampled data, large data sets
; AUTHOR
;
Mort Canty (2004)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
Tiled_EMrun
; ARGUMENTS:
;
Event (if used as a plug-in menu item)
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
EM (PROGRESSBAR__DEFINE (FSC_COLOR))
;
FKM
;
CLUSTER_EM
;
CLASS_LOOKUP_TABLE
;--------------------------------------------------------------------;+
; NAME:
;
EM
; PURPOSE:
;
Expectation maximization clustering algorithm for Gaussian mixtures.
;
Takes data array Xs (data as column vectors) and initial
;
class membership probability matrix U as input.
;
Returns U, the class centers Ms, Priors Ps and final
;
class covariances Fs.
;
Allows for simulated annealing
;
Ref: Gath and Geva, IEEE Trans. Pattern Anal. and Mach.
;
Intell. 3(3):773-781, 1989
;
Hilger, Exploratory Analysis of Multivariate Data,
;
PhD Thesis, IMM, Technical University of Denmark, 2001
; AUTHOR
;
Mort Canty (2005)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
Pro EM, Xs, U, Ms, Ps, Fs, unfrozen=unfrozen, wnd=wnd, $
;
maxiter=maxiter, miniter=miniter, verbos=verbose, $
;
pdens=pdens, pd_exclude=pdens_exclude, fhv=fhv, T0=T0
; ARGUMENTS:
;
Xs: input observations array (column vectors)
;
U: initial class probability membership matrix (column vectors)
194
;
Ms: cluster means (output)
;
Ps: cluster priors (output)
;
Fs: cluster covariance matrices (output)
; KEYWORDS:
;
unfrozen: Indices of the observations which
;
take part in the iteration (default all)
;
wnd: window for displaying the log likelihood (optional)
;
maxinter: maximum iterations (optional)
;
minimter: minimum iterations (optional)
;
pdens: partition density (output, optional)
;
pd_exclude: array of classes to be excluded from pdens and fhv (optional)
;
fhv: fuzzy hypervolume (output, optional)
;
T0: initial annealing temperature (default 1.0)
;
verbose: set to print output info to IDL log
; DEPENDENCIES:
;
PROGRESSBAR__DEFINE (FSC_COLOR)
;-------------------------------------------------------------------;+
; NAME:
;
CLUSTER_EM
; PURPOSE:
;
Cluster data after running the EM algorithm
;
Takes data array (as row vectors), means Ms (as row vectors), priors Ps
;
and covariance matrices Fs and returns the class labels.
;
Class membership probabilities are returned in class_probs
; AUTHOR
;
Mort Canty (2004)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
labels = Cluster_EM(Xs,Ms,Ps,Fs,class_probs=class_probs,progress_bar=progress_bar)
;
; ARGUMENTS:
;
Xs: data array
;
Ms: cluster means
;
Ps: cluster priors
;
Fs: cluster covariance matrices
; KEYWORDS:
;
class_probs (optional): contains cluster membership probability image
;
progress_bar: set to 0 if no progressbar is desired
; DEPENDENCIES:
;
PROGRESSBAR__DEFINE (FSC_COLOR)
;--------------------------------------------------------------------
D.6.4
PLR is an ENVI extension for performing probabilistic relaxation on rule (class membership
probability) images generated by supervised and unsupervised classification algorithms. It
is invoked from the ENVI main menu as
195
196
; PURPOSE:
;
ENVI extension for postclassification with
;
Probabilistic Label Relaxation
;
Ref. Richards and Jia, Remote Sensing Digital Image Analysis (1999) Springer
;
Processes a rule image (class membership probabilities), outputs a
;
new classification file
; AUTHOR;
;
Mort Canty (2004)
Juelich Research Center
;
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
Plr_Reclass
; ARGUMENTS
;
Event (if used as a plug-in menu item)
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
PROGRESSBAR_DEFINE (FSC_COLOR)
;----------------------------------------------------------------------
D.6.5
SAMPLE SOMRUN is an ENVI extension for clustering with the Kohonen self-organizing map.
It is invoked from the ENVI main menu as
Classification/Unsupervised/Kohonen SOM
Usage
In the Choose multispectral image window select the (spatial/spectral subset of the)
desired image. In the Cube side dimension box select the desired dimension of the cubic
neural network (default 6). In the SOM Output box select the output file name or memory.
Source headers
;+
; NAME:
;
SAMPLE_SOMRUN
; PURPOSE:
;
ENVI extension for Kohonen Self Organizing Map with sampled data
Ref. T. Kohonen, Self Organization and Associative Memory, Springer 1989.
;
; AUTHOR
;
Mort Canty (2004)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
Sample_KFrun
; ARGUMENTS:
;
Event (if used as a plug-in menu item)
197
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
PROGRESSBAR__DEFINE (FSC_COLOR)
;---------------------------------------------------------------------
D.6.6
MAD VIEW is an IDL GUI (graphical user interface) for viewing and processing MAD and
MNF/MAD change images. It is invoked from the ENVI main menu as
Basic Tools/Change Detection/MAD View
Usage
This extension is provided with an on-line help.
Source headers
;+
; NAME:
;
MAD_VIEW
; PURPOSE:
;
GUI for viewing, thresholding and clustering MAD/MNF images
;
Ref: A. A. Nielsen et al. Remote Sensing of Environment 64 (1998), 1-19
;
A. A. Nielsen private communication (2004)
; AUTHOR
;
Mort Canty (2004)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
Mad_View
; ARGUMENTS:
;
Event (if used as a plug-in menu item)
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
EM
;
CLUSTER_EM
;
PROGRESSBAR_DEFINE (FSC_COLOR)
;---------------------------------------------------------------------
198
D.7
FFNCG RUN is an ENVI extension for supervised classification with a two-layer feed forward
neural network. It uses the scaled conjugate gradient training algorithm and can be used as
a replacement for the much slower backpropagation neural network implemented in ENVI.
It is invoked from the ENVI main menu as
Classification/Supervised/Neural Net/Conjugate Gradient
Usage
In the Enter file for classification window select the (spatial/spectral subset of the)
desired image. This must be in BIP format. In the ROI selection box choose the training
regions desired. In the Output FFN classification to file box select the output file
name. In the Output FFN probabilities to file box select the output file name for the
probabilities (rule) image, or Cancel if this is not desired. The rule image will be byte
coded (0 = probability 0, 255 = probability 1). In the Number of hidden units box select
the number of neurons in the first layer (default 4). As the calculation proceeds, the cost
function is displayed in a plot window. The calculation can be interrupted with Cancel.
Source headers
;+
; NAME:
;
FFNCG_RUN
; PURPOSE:
;
ENVI extension for classification of a multispectral image
;
with a feed forward neural network using scaled conjugate gradient training
; AUTHOR;
;
Mort Canty (2004)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
FfnCG_Run
; ARGUMENTS
;
Event (if used as a plug-in menu item)
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
PROGRESSBAR_DEFINE (FSC_COLOR)
;
FFNCG__DEFINE (FFN__DEFINE)
;---------------------------------------------------------------------;+
; NAME:
;
FFNCG__DEFINE
; PURPOSE:
;
Object class for implementation of a two-layer, feed-forward
;
neural network for classification of multi-spectral images.
199
;
Implements scaled conjugate gradient training.
;
Ref: C. Bishop, Neural Networks for Pattern Recognition, Oxford 1995
;
M. Canty, Fernerkundung mit neuronalen Netzen, Expert 1999
; AUTHOR
;
Mort Canty (2005)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
ffn = Obj_New("FFNCG",Xs,Ys,L)
; ARGUMENTS:
;
Xs: array of observation column vectors
;
Xs: array of class label column vectors of form (0,0,1,0,0,...0)^T
;
L:
number of hidden neurons
; KEYWORDS
;
None
; METHODS:
;
ROP: determine the matrix product v^t.H, where H is the Hessian of
;
the cost function wrt the weights, using the R-operator
;
r = ffn -> Rop(v)
;
HESSIAN: claculate the Hessian
;
h = ffn -> Hessian()
;
EIGENVALUES: calculate the eigenvalues of the Hessian
e = ffn -> Eigenvalues()
;
;
GRADIENT: calculate the gradient of the global cost function
;
g = ffn -> Gradient()
;
TRAIN: train the network
;
ffn -> train
; DEPENDENCIES:
;
FFN__DEFINE
;
PROGRESSBAR (FSC_COLOR)
;-------------------------------------------------------------;+
; NAME:
;
FFN__DEFINE
; PURPOSE:
;
Object class for implementation of a two-layer, feed-forward
;
neural network for classification of multi-spectral images.
;
This is a generic class with no training methods.
;
Ref: M. Canty, Fernerkundung mit neuronalen Netzen, Expert 1999
; AUTHOR
Mort Canty (2005)
;
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
ffn = Obj_New("FFN",Xs,Ys,L)
; ARGUMENTS:
;
Xs: array of observation column vectors
;
Ys: array of class label column vectors of form (0,0,1,0,0,...0)^T
;
L:
number of hidden neurons
200
; KEYWORDS
;
None
; METHODS (external):
;
OUTPUT: return a class membership probability vector for an observation
;
row vector x
;
p = ffn -> Output(x)
;
CLASS: return the class for an observation row vector x
;
p = ffn -> Class(x)
; DEPENDENCIES:
;
None
;--------------------------------------------------------------
D.8
FFNKAL RUN is an ENVI extension for supervised classification with a two-layer feed forward
neural network. It uses a fast Kalman Filter training algorithm and can be used as a
replacement for the much slower backpropagation neural network implemented in ENVI. It
is invoked from the ENVI main menu as
Classification/Supervised/Neural Net/Kalman Filter
Usage
In the Enter file for classification window select the (spatial/spectral subset of the)
desired image. This must be in BIP format. In the ROI selection box choose the training
regions desired. In the Output FFN classification to file box select the output file
name. In the Output FFN probabilities to file box select the output file name for the
probabilities (rule) image, or Cancel if this is not desired. The rule image will be byte coded
(0 = probability 0, 255 = probability 1). In the Number of hidden units box select the
number of neurons in the first layer (default 4). As the calculation proceeds, the logarithm
of the cost function is displayed in a plot window. The calculation can be interrupted with
Cancel.
Source headers
;+
; NAME:
;
FFNKAL_RUN
; PURPOSE:
;
Classification of a multispectral image with feed forward neural network
;
using Kalman filter training
; AUTHOR;
;
Mort Canty (2004)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
FfnKal_Run
; ARGUMENTS
;
Event (if used as a plug-in menu item)
201
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
PROGRESSBAR_DEFINE (FSC_COLOR)
;
FFNKAL__DEFINE (FFN__DEFINE)
;---------------------------------------------------------------------;+
; NAME:
;
FFNKAL__DEFINE
; PURPOSE:
;
Object class for implementation of a two-layer, feed-forward
;
neural network for classification of multi-spectral images.
;
Implements Kalman filter training.
;
Ref: M. Canty, Fernerkundung mit neuronalen Netzen, Expert 1999
; AUTHOR
;
Mort Canty (2005)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
ffnkal = Obj_New("FFNKAL",Xs,Ys,L)
; ARGUMENTS:
;
Xs: array of observation column vectors
;
Xs: array of class label column vectors of form (0,0,1,0,0,...0)^T
;
L:
number of hidden neurons
; KEYWORDS
;
None
; METHODS:
;
OUTPUT (inherited): return a class membership probability vector for an observation
;
row vector x
;
p = ffnbp -> Output(x)
CLASS (inherited): return the class for an observation row vector x
;
;
p = ffnbp -> Class(x)
;
TRAIN: train the network
;
ffnkal -> train
; DEPENDENCIES:
;
FFN__DEFINE
;
PROGRESSBAR(FSC_COLOR)
;--------------------------------------------------------------
D.9
FFN RUN is an ENVI extension for supervised classification with a two-layer feed forward
neural network. It uses both the Kalman filter and the scaled conjugate gradient training
algorithm and can be used as a replacement for the much slower backpropagation neural
network implemented in ENVI. It is invoked from the ENVI main menu as
Classification/Supervised/Neural Net/Hybrid
202
Usage
In the Enter file for classification window select the (spatial/spectral subset of the)
desired image. This must be in BIP format. In the ROI selection box choose the training
regions desired. In the Output FFN classification to file box select the output file
name. In the Output FFN probabilities to file box select the output file name for the
probabilities (rule) image, or Cancel if this is not desired. The rule image will be byte coded
(0 = probability 0, 255 = probability 1). In the Number of hidden units box select the
number of neurons in the first layer (default 4). As the Kalman filter calculation proceeds,
the log of the cost function is displayed in a plot window. The calculation can be interrupted
with Cancel. Then calculation continues where the Kalman filter left off with the scaled
conjugate gradient training method. The calculation can again be interrupted with Cancel.
Source headers
;+
; NAME:
;
FFN_RUN
; PURPOSE:
;
ENVI extension for classification of a multispectral image
;
with a feed forward neural network using Kalman filter
;
plus scaled conjugate gradient training
; AUTHOR;
;
Mort Canty (2004)
;
Juelich Research Center
;
m.canty@fz-juelich.de
; CALLING SEQUENCE:
;
Ffn_Run
; ARGUMENTS
;
Event (if used as a plug-in menu item)
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
PROGRESSBAR_DEFINE (FSC_COLOR)
;
FFNKAL__DEFINE (FFN__DEFINE)
;
FFNCG__DEFINE
;----------------------------------------------------------------------
Bibliography
[AABG02] B. Aiazzi, L. Alparone, S. Baronti, and A. Garzelli. Context-driven fusion of
high spatial and spectral resolution images based on oversampled multiresolution
analysis. IEEE Transactions on Geoscience and Remote Sensing, 40(10):2300
2312, 2002.
[And84]
T. W. Anderson. An Introduction to Multivariate Statistical Analysis, 2nd Edition. Wiley Series in Probability and Mathematical Statistics, 1984.
[AS99]
[BFH75]
[Bil89]
[Bis95]
[BP00]
[CNS04]
M. J. Canty, A. A. Nielsen, and M. Schmidt. Automatic radiometric normalization of multitemporal satellite imagery. Remote Sensing of Environment,
91(3,4):441451, 2004.
[DH73]
[Dun73]
J. C. Dunn. A fuzzy relative of the isodata process and its use in detecting
compact well-separated clusters. Journal of Cybernetics, PAM1-1:3257, 1973.
[Fra96]
[GG89]
I. Gath and A. B. Geva. Unsupervised optimal fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intellegence, 3(3):773781, 1989.
[GW02]
204
BIBLIOGRAPHY
[Hab95]
[Hil01]
K. B. Hilger. Exploratory Analysis of Multivariate Data. PhD Thesis, IMMPHD-2001-89, Technical University of Denmark, 2001.
[HKP91]
[Hu62]
[JRR99]
[Koh89]
[KS79]
[LMM95]
H. Li, B. S. Manjunath, and S. K. Mitra. A contour-based approach to multisensor image registration. IEEE Transactions on Image Processing, 4(3):320334,
1995.
[Mal89]
S. G. Mallat. A theory for mutiresolution signal decomposition: the wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence,
11(7):674693, 1989.
[Mil99]
[Moe93]
[NCS98]
A. A. Nielsen, K. Conradsen, and J. J. Simpson. Multivariate alteration detection (MAD) and MAF processing in multispectral, bitemporal image data:
New approaches to change detection studies. Remote Sensing of Environment,
64:119, 1998.
[Pal98]
G. Palubinskas. K-means clustering algorithm using the entropy. SPIE (European Symposium on Remote Sensing, Conference on Image and Signal Processing
for Remote Sensing), September, Barcelona, Vol 3500:6371, 1998.
[Pat77]
[RCSA03] D. Riano, E. Chuvieco, J. Salas, and I. Aguado. Assessment of different topographic corrections in Landsat-TM data for mapping vegetation types. IEEE
Transactions on Geoscience and Remote Sensing, 41(5):10561061, 2003.
BIBLIOGRAPHY
205
[Rip96]
[RW00]
T. Ranchin and L. Wald. Fusion of high spatial and spectral resolution images:
the ARSIS concept and its implementation. Photogrammetric Engineering and
Remote Sensing, 66(1):4961, 2000.
[Sie65]
[Sin89]
[SP90]
S. Shah and F. Palmieri. Meka - A fast, local algorithm for training feed forward
neural networks. Proceedings if the International Joint Conference on Neural
Networks, San Diego, I(3):4146, 1990.
[TGG82]
[TH01]
C. V. Tao and Y. Hu. A comprehensive study of the rational function model for
photogrammetric processing. Photogrammetric Engineering and Remote Sensing, 67(12):13471357, 2001.
[WB02]
Z. Wang and A. C. Bovik. A universal image quality index. IEEE Signal Processing Letters, 9(3):8184, 2002.
[Wie97]
R. Wiemker. An iterative spectral-spatial bayesian labelling approach for unsupervised robust change detection on remotely sensed multispectral imagery.
Proceedings of the 7th International Conference on Computer Analysis of Images and Patterns, Springer LCNS Vol 1296:263370, 1997.
[WK91]
206
BIBLIOGRAPHY