Image Analysis and Pattern Recognition For Remote Sensing With Algorithms in ENVI/IDL

Image Analysis and Pattern Recognition
for Remote Sensing

with Algorithms in ENVI/IDL
Morton John Canty

ulich GmbH
Forschungszentrum J
m.canty@fz-juelich.de
March 21, 2005
Contents
1 Images, Arrays and Vectors
1.1 Multispectral satellite images .
1.2 Algebra of vectors and matrices
1.3 Eigenvalues and eigenvectors .
1.4 Finding minima and maxima .
2 Image Statistics
2.1 Random variables . . . .
2.2 The normal distribution
2.3 A special function . . .
2.4 Conditional probabilities
Theorem . . . . . . . . .
2.5 Linear regression . . . .
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . .
. . . . . . .
. . . . . . .
and Bayes
. . . . . . .
. . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
4
6
8
13
. . . . . . . . . . . . . . . . . . . . . . . 13
. . . . . . . . . . . . . . . . . . . . . . . 14
. . . . . . . . . . . . . . . . . . . . . . . 16
. . . . . . . . . . . . . . . . . . . . . . . 17
. . . . . . . . . . . . . . . . . . . . . . . 18
3 Transformations
3.1 Fourier transforms . . . . . . . . . . . . . . .
3.1.1 Discrete Fourier transform . . . . . . .
3.1.2 Discrete Fourier transform of an image
3.2 Wavelets . . . . . . . . . . . . . . . . . . . . .
3.3 Principal components . . . . . . . . . . . . .
3.4 Minimum noise fraction . . . . . . . . . . . .
3.5 Maximum autocorrelation factor (MAF) . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
21
22
23
23
24
25
28
4 Radiometric enhancement
4.1 Lookup tables . . . . . . . . . . . .
4.1.1 Histogram equalization . .
4.1.2 Histogram matching . . . .
4.2 Convolutions . . . . . . . . . . . .
4.2.1 Laplacian of Gaussian filter
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
31
32
32
33
34
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Topographic modelling
39
5.1 RST transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
i
ii
CONTENTS
5.2
5.3
5.4
5.5
5.6
Imaging transformations . . . . . . . . . .
Camera models and RFM approximations
Stereo imaging, elevation models and
orthorectification . . . . . . . . . . . . . .
Slope and aspect . . . . . . . . . . . . . .
Illumination correction . . . . . . . . . . .
6 Image Registration
6.1 Frequency domain registration
6.2 Feature matching . . . . . . . .
6.2.1 Contour detection . . .
6.2.2 Closed contours . . . . .
6.2.3 Chain codes . . . . . . .
6.2.4 Invariant moments . . .
6.2.5 Contour matching . . .
6.2.6 Consistency check . . .
6.3 Re-sampling and warping . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7 Image Sharpening
7.1 HSV fusion . . . . . . . . . . . . .
7.2 Brovey fusion . . . . . . . . . . . .
7.3 PCA fusion . . . . . . . . . . . . .
7.4 Wavelet fusion . . . . . . . . . . .
7.4.1 Discrete wavelet transform
` trous filtering . . . . . .
7.4.2 A
7.5 Quality indices . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . . . . . . . . . . . . . . . 40
. . . . . . . . . . . . . . . . . . . . 41
. . . . . . . . . . . . . . . . . . . . 44
. . . . . . . . . . . . . . . . . . . . 50
. . . . . . . . . . . . . . . . . . . . 51
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8 Change Detection
8.1 Algebraic methods . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Principal components . . . . . . . . . . . . . . . . . . . . . .
8.3 Post-classification comparison . . . . . . . . . . . . . . . . . .
8.4 Multivariate alteration detection . . . . . . . . . . . . . . . .
8.4.1 Canonical correlation analysis . . . . . . . . . . . . . .
8.4.2 Solution by Cholesky factorization . . . . . . . . . . .
8.4.3 Properties of the MAD components . . . . . . . . . .
8.4.4 Covariance of MAD variates with original observations
8.4.5 Scale invariance . . . . . . . . . . . . . . . . . . . . . .
8.4.6 Improving signal to noise . . . . . . . . . . . . . . . .
8.4.7 Decision thresholds . . . . . . . . . . . . . . . . . . . .
8.5 Radiometric normalization . . . . . . . . . . . . . . . . . . . .
9 Unsupervised Classification
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
53
53
55
56
56
56
56
57
57
58
.
.
.
.
.
.
.
61
61
63
63
64
64
65
66
.
.
.
.
.
.
.
.
.
.
.
.
69
69
70
70
71
71
72
73
74
74
75
75
77
79
CONTENTS
iii
9.1
A simple cost function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.2
Algorithms that minimize the simple cost function . . . . . . . . . . . . . . . 81
9.3
9.2.1
K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
9.2.2
Extended K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9.2.3
Agglomerative hierarchical clustering . . . . . . . . . . . . . . . . . . . 83
9.2.4
Fuzzy K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
EM Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.3.1
Simulated annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
9.3.2
Partition density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
9.3.3
Including spatial information . . . . . . . . . . . . . . . . . . . . . . . 87
9.4
The Kohonen Self Organizing Map . . . . . . . . . . . . . . . . . . . . . . . . 89
9.5
Unsupervised classification of changes . . . . . . . . . . . . . . . . . . . . . . 91
10 Supervised Classification
93
10.1 Bayes decision rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

10.2 Training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
10.3 Bayes Maximum likelihood classification . . . . . . . . . . . . . . . . . . . . . 95
10.4 Non-parametric methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.5 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
10.5.1 The feed-forward network . . . . . . . . . . . . . . . . . . . . . . . . . 99
10.5.2 Cost functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
10.5.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
10.5.4 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
10.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
10.6.1 Standard deviation of misclassification . . . . . . . . . . . . . . . . . . 111
10.6.2 Model comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
10.6.3 Confusion matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
11 Hyperspectral analysis
117
11.1 Mixture modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

11.1.1 Full linear unmixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
11.1.2 Unconstrained linear unmixing . . . . . . . . . . . . . . . . . . . . . . 119
11.1.3 Intrinsic end-members and pixel purity . . . . . . . . . . . . . . . . . . 119
11.2 Orthogonal subspace projection . . . . . . . . . . . . . . . . . . . . . . . . . . 121
A Least Squares Procedures
A.1 Generalized least squares
125
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
A.2 Recursive least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

A.3 Orthogonal regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
B The Discrete Wavelet Transformation
131
iv
CONTENTS
B.1
B.2
B.3
B.4
B.5
B.6
B.7
Inner product space . . . . . .

Haar wavelets . . . . . . . . . .
Multi-resolution analysis . . . .
Fixpoint wavelet approximation
The mother wavelet . . . . . .
The Daubechies wavelet . . . .
Wavelets and filter banks . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
C Advanced Neural Network Training Algorithms

C.1 The Hessian matrix . . . . . . . . . . . . . . . . .
C.1.1 The R-operator . . . . . . . . . . . . . . .
C.1.2 Calculating the Hessian . . . . . . . . . .
C.2 Scaled conjugate gradient training . . . . . . . .
C.2.1 Conjugate directions . . . . . . . . . . . .
C.2.2 Minimizing a quadratic function . . . . .
C.2.3 The algorithm . . . . . . . . . . . . . . .
C.3 Kalman filter training . . . . . . . . . . . . . . .
C.3.1 Linearization . . . . . . . . . . . . . . . .
C.3.2 The algorithm . . . . . . . . . . . . . . .
D ENVI Extensions
D.1 Installation . . . . . . . . . . . . . . . .
D.2 Topographic modelling . . . . . . . . . .
D.2.1 Calculating building heights . . .
D.2.2 Illumination correction . . . . . .
D.3 Image registration . . . . . . . . . . . .
D.4 Image fusion . . . . . . . . . . . . . . .
D.4.1 DWT fusion . . . . . . . . . . . .
D.4.2 ATWT fusion . . . . . . . . . . .
D.4.3 Quality index . . . . . . . . . . .
D.5 Change detection . . . . . . . . . . . . .
D.5.1 Multivariate Alteration Detecton
D.5.2 Maximum autocorrelation factor
D.5.3 Radiometric normalization . . .
D.6 Unsupervised classification . . . . . . . .
D.6.1 Hierarchical clustering . . . . . .
D.6.2 Fuzzy K-means clustering . . . .
D.6.3 EM clustering . . . . . . . . . . .
D.6.4 Probabilistic label relaxation . .
D.6.5 Kohonen self organizing map . .
D.6.6 A GUI for change clustering . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
131
131
137
138
140
141
143
.
.
.
.
.
.
.
.
.
.
151
. 151
. 152
. 155
. 156
. 156
. 157
. 160
. 163
. 164
. 165
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
171
. 171
. 172
. 172
. 173
. 175
. 177
. 177
. 179
. 182
. 184
. 184
. 186
. 187
. 189
. 189
. 190
. 192
. 194
. 196
. 197
CONTENTS
D.7 Neural network: Scaled conjugate gradient . . . . . . . . . . . . . . . . . . . . 198

D.8 Neural network: Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
D.9 Neural network: Hybrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Bibliography
203
vi
CONTENTS
Chapter 1
Images, Arrays and Vectors

1.1
Multispectral satellite images
There are a number of multispectral satellite-based sensors currently in orbit which are used
for earth observation. Representative of these we mention here the Landsat ETM+ system.
The ETM+ instrument on the Landsat 7 spacecraft contains sensors to measure radiance
in three spectral intervals:
visible and near infrared (VNIR) bands - bands 1,2,3,4, and 8 (PAN) with a spectral
range between 0.4 and 1.0 micrometer.
short wavelength infrared (SWIR) bands - bands 5 and 7 with a spectral range between
1.0 and 3.0 micrometer.
thermal long wavelength infrared (LWIR) band - band 6 with a spectral range between
8.0 and 12.0 micrometer.
In addition a panchromatic (PAN) image (band 8) covering the visible spectrum is provided.
Ground resolutions are 15m (PAN), 30m (VNIR,SWIR) and 60m (LWIR). Figure 1.1 shows
a color composite image of a Landsat 7 scene over Morocco acquired in 1999.
A single multispectral image can be represented as an array of gray-scale values or digital
numbers
gk (i, j), 1 i c, 1 j r,
where c is the number of pixel columns and r is the number of pixel rows. If we are dealing
with an N -band multispectral image, then the index k, 1 k N , denotes the spectral
band. Often a pixel intensity is stored in a single byte, so that 0 gk 255.
The gray-scale values are the result of sampling along an array of sensors the at-sensor
radiance f (x, y) at wavelength due to sunlight reflected from some point (x, y) on the
Earths surface and focussed by the satellites optical system at the sensors. Ignoring atmospheric effects this radiance is given roughly by
f (x, y) i (x, y)r (x, y),
where i (x, y) is the suns irradiance at the surface in units of watt/m2 m, and r (x, y)
is the surface reflectance, a number between 0 and 1. The conversion between gray-scale
1
CHAPTER 1. IMAGES, ARRAYS AND VECTORS
Figure 1.1: Color composite of bands 4 (red), 5 (green) and 7 (blue) for a Landsat ETM+
image over Morocco.
1.1. MULTISPECTRAL SATELLITE IMAGES
or digital number g and at-sensor radiance f is determined by the sensor calibration as

measured (and maintained) by the satellite image provider:
f = Cg(i, j) + fmin
where C = (fmax fmin )/255, in which fmax and fmin are maximum and minimum measurable radiances at the sensor.
Atmospheric scattering and absorption models are used to calculate surface reflectance
from the observed at-sensor radiance, as it is the reflectance which is directly related to the
physical properties of the surface being examined.
Various conventions can be used for storing the image array g(i, j) in computer memory
or on storage media. In band interleaved by pixel (BIP) format, for example, a two-channel,
3 3 pixel image would be stored as
g1 (1, 1)
g1 (1, 2)
g1 (1, 3)
g2 (1, 1)
g2 (1, 2)
g2 (1, 3)
g1 (2, 1)
g1 (2, 2)
g1 (2, 3)
g2 (2, 1)
g2 (2, 2)
g2 (2, 3)
g1 (3, 1)
g1 (3, 2)
g1 (3, 3)
g2 (3, 1)
g2 (3, 2)
g2 (3, 3),
whereas in band interleaved by line (BIL) it would be stored as

g1 (1, 1)
g1 (1, 2)
g1 (1, 3)
g1 (2, 1)
g1 (2, 2)
g2 (2, 3)
g1 (3, 1)
g1 (3, 2)
g1 (3, 3)
g2 (1, 1)
g2 (2, 1)
g2 (3, 1)
g2 (2, 1)
g1 (2, 2)
g1 (2, 3)
g2 (3, 1)
g2 (2, 3)
g2 (3, 3),
and in band sequential (BSQ) format it is stored as

g1 (1, 1)
g1 (1, 2)
g1 (1, 3)
g2 (1, 1)
g2 (1, 2)
g2 (1, 3)
g1 (2, 1)
g1 (2, 2)
g1 (2, 3)
g2 (2, 1)
g2 (2, 2)
g2 (2, 3)
g1 (3, 1)
g1 (3, 2)
g1 (3, 3)
g2 (3, 1)
g2 (3, 2)
g2 (3, 3).
In the computer language IDL, so-called row major indexing is used for arrays and the
elements in an array are numbered from zero. This means that, if a gray-scale image g is
stored in an IDL array variable G, then the intensity value g(i, j) is addressed as G[i-1,j-1].
An N -band multispectral image is stored in BIP format as an N c r array in IDL, in
BIL format as a c N r and in BSQ format as an c r N array.
Auxiliary information, such as image acquisition parameters and georeferencing, is normally included with the image data on the same file, and the format may or may not make
use of compression algorithms. Examples are the geoTIFF1 file format used for example by
Space Imaging Inc. for distributing Carterra(c) imagery and which includes lossless compression, the HDF (Hierachical Data Format) in which for example ASTER images are distributed
and the cross-platform PCDSK format employed by PCI Geomatics with its image processing software, which is in plain ASCII code and not compressed. ENVI uses a simple flat
binary file structure with an additional ASCII header file.
1 geoTIFF refers to TIFF files which have geographic (or cartographic) data embedded as tags within the
TIFF file. The geographic data can then be used to position the image in the correct location and geometry
on the screen of a geographic information display.
1.2
Algebra of vectors and matrices
It is very convenient to use a vector representation for multispectral images, namely
g1 (i, j)
..
g(i, j) =
(1.1)
,
.
gN (i, j)
which is a column vector of multispectral gray-scale values at the position (i, j).
Since we will be making extensive use of the vector notation of Eq. (1.1) we review
here some of the basic properties of vectors and matrices. We can illustrate most of these
properties in just two dimensions.
x2
x1
Figure 1.2: A vector in two dimensions.
The transpose of the two-dimensional column vector shown in Fig. 1.2,

x1
x=
,
x2
is the row vector
x> = (x1 , x2 ).
The sum of two vectors is given by

x1
y1
x1 + y1
x+y =
+
=
,
x2
y2
x2 + y2
and the inner product by
x> y = (x1 , x2 )
y1
y2

= x1 y1 + x2 y2 .
The length or norm of the vector x is

x = |x| =
q
x21 + x22 =
x> x .
The programming language IDL is especially good at manipulating vectors and matrices:
1.2. ALGEBRA OF VECTORS AND MATRICES
IDL> x=[[1],[2]]
IDL> print,x
1
2
IDL> print,transpose(x)
1
2
>
x cos
Figure 1.3: The inner product.
The inner product can be written in terms of the vector lengths and the angle between
the two vectors as
x> y = |x||y| cos = xy cos ,
see Fig. 1.3. If = 90o the vectors are orthogonal so that
x> y = 0.
Any vector can be decomposed into orthogonal unit vectors:

x1
1
0
x=
= x1
+ x2
.
0
1
x2
A two-by-two matrix is written

A=
a11
a21
a12
a22

.
When a matrix is multiplied with a vector the result is another vector, e.g.

a11 a12
x1
a11 x1 + a12 x2
Ax =
=
.
a21 a22
x2
a21 x1 + a22 x2
The IDL operator for matrix and vector multiplication is ##.
IDL> a=[[1,2],[3,4]]
IDL> print,a
1
2
3
4
IDL> print,a##x
5
11
Matrices also have a transposed form, obtained by interchanging their rows and columns:

a11 a21
>
A =
.
a12 a22
The product of two matrices is given by

a11 a12
b11
AB =
a21 a22
b21
b12
b22

=
a11 b11 + a12 b21
and is another matrix. The determinant of a two-dimensional matrix is

|A| = det A = a11 a22 a12 a21 .
The outer product of two vectors is a matrix:

x1
x1 0
y1
>
xy =
(y1 , y2 ) =
x2
x2 0
0
y2
0

=
x1 y 1
x2 y 1
x1 y2
x2 y2
The identity matrix is given by

I=
1
0
0
1

,
IA = AI = A.
The matrix inverse A1 is defined in terms of the identity matrix according to

A1 A = AA1 = I.
In two dimensions it is easy to verify that
A1 =
1
|A|
a22
a21
a12
a11

.
IDL> print, determ(float(a))

-2.00000
IDL> print, invert(a)
-2.00000
1.00000
1.50000
-0.500000
IDL> print, a##invert(a)
1.00000
0.000000
0.000000
1.00000
If |A| = 0, then A has no inverse and is said to be a singular matrix. The trace of a
square matrix is the sum of its diagonal elements:
Tr A = a11 + a22 .
1.3
Eigenvalues and eigenvectors
The statistical properties of ensembles of pixel intensities (for example entire images or
specific land-cover classes) are often approximated by their mean values and covariance
1.3. EIGENVALUES AND EIGENVECTORS
matrices. As we will see later, covariance matrices are always symmetric. A matrix A is
symmetric if it doesnt change when it is transposed, i.e. if
A = A> .
Very often we have to solve the so-called eigenvalue problem, which is to find eigenvectors x
and eigenvalues that satisfy the equation
Ax = x
or, equivalently,
a11
a21
a12
a22

x1
x2

=
x1
x2

.
This is the same as the two equations

(a11 )x1 + a12 x2 = 0
(1.2)
a21 x1 + (a22 )x2 = 0.

If we eliminate x1 and make use of the symmetry a12 = a21 , we obtain
[(a11 )(a22 ) a212 ]x2 = 0.
In general x2 6= 0, so we must have
(a11 )(a22 ) a212 = 0,
which is known as the characteristic equation for the eigenvalue problem. It is a quadratic
equation in with solutions

q
1
(1) =
a11 + a22 + (a11 + a22 )2 4(a11 a22 a212 )
2

(1.3)
q
1
(2) =
a11 + a22 (a11 + a22 )2 4(a11 a22 a212 ) .
2
Thus there are two eigenvalues and, correspondingly, two eigenvectors x(1) and x(2) , which
can be obtained by substituting (1) and (2) into (1.2) and solving for x1 and x2 . It is easy
to show that the eigenvalues are orthogonal
(x(1) )> x(2) = 0.
The matrix formed by the two eigenvectors,

u = (x
(1)
(2)
,x
)=
is said to diagonalize the matrix a. That is

(1)
u> Au =
0
(1)
x1
(1)
x2
0
(2)
(2)
x1
(2)
x2

,
We can illustrate the whole procedure in IDL as follows:
(1.4)
IDL> a=float([[1,2],[2,3]])
IDL> print,a
1.00000
2.00000
2.00000
3.00000
IDL> print,eigenql(a,eigenvectors=u,/double)
4.2360680
-0.23606798
IDL> print,transpose(u)##a##u
4.2360680 -2.2204460e-016
-1.6653345e-016
-0.23606798
Note that, after diagonalization, the off-diagonal elements are not precisely zero due to
rounding errors in the computation.
All of the above properties generalize easily to N dimensions.
1.4
Finding minima and maxima
In order to maximize some desirable property of a multispectral image, such as signal to

noise or spread in intensity, we often need to take derivatives of vectors. A vector (partial)
and is defined as the vector

derivative in two dimensions is written x

1
0
=
+
.
0 x1
1 x2
x
Many of the operations with vector derivatives correspond exactly to operations with ordinary scalar derivatives (They can all be verified easily by writing out the expressions
component-by component):
The scalar expression
>
(x y) = y
x
analogous to
xy = y
x
>
(x x) = 2x
x
analogous to
2
x = 2x
x
x> Ay,
where A is a matrix, is called a quadratic form. We have

>
(x Ay) = Ay
x
>
(x Ay) = A> x
y
and
>
(x Ax) = Ax + A> x.
x
Note that, if A is a symmetrix matrix, this last equation can be written
>
(x Ax) = 2Ax.
x
Suppose x is a critical point of the function f (x), i.e.

d
d

f (x ) = f (x)
= 0,
dx
d
x=x
(1.5)
1.4. FINDING MINIMA AND MAXIMA
f (x)
x
d
dx f (x )
=0
x
Figure 1.4: A function of one variable.
see Fig. 1.4. Then f (x ) is a local minimum if

express f (x) as a Taylor series about x
f (x) = f (x ) + (x x )
d2
dx2 f (x )
> 0. This becomes obvious if we
d
d2
f (x ) + (x x )2 2 f (x ) + . . . .
dx
dx
For |x x | sufficiently small this is equivalent to

f (x) f (x ) + (x x )2
d2
f (x ).
dx2
The situation is similar for scalar functions of a vector:

f (x) f (x ) + (x x )>
f (x ) 1
+ (x x )> H(x x ).
x
2
(1.6)
where H is called the Hessian matrix:

(H)ij =
2
f (x ).
xi xj
In the neighborhood of the critical point, since
f (x )
x
(1.7)
= 0, we get the approximation
f (x) f (x ) + (x x )> H(x x ).

Now the condition for a local minimum is that the Hessian matrix be positive definite at the
point x . Positive definiteness means that
x> Hx > 0
for all x 6= 0.
(1.8)
Suppose we want to find a minimum (or maximum) of a scalar function f (x) of the
vector x. If there are no constraints, then we solve the set of equations
f (x)
= 0,
xi
i = 1, 2,
or, in terms of our notation for vector derivatives,

f (x)
0
=0=
.
0
x
10
However suppose that x is constrained by the equation

g(x) = 0.
For example, we might have
g(x) = x21 + x22 1 = 0
which constrains x to lie on a circle of radius 1.
Finding an minimum of f subject to g = 0 is equivalent to finding an unconstrained
minimum of
f (x) + g(x),
(1.9)
where is called a Lagrange multiplier and is treated like an additional variable, see [Mil99].
That is, we solve the set of equations
(f (x) + g(x)) = 0,
xi
(f (x) + g(x)) = 0.
i = 1, 2
(1.10)
The latter equation is just g(x) = 0.

For example, let f (x) = ax21 + bx22 and g(x) = x1 + x2 1. Then we get the three
equations
(f (x) + g(x)) = 2ax1 + = 0

x1
(f (x) + g(x)) = 2bx2 + = 0

x2
(f (x) + g(x)) = x1 + x2 1 = 0
The solution is
b
a
, x2 =
.
x1 =
a+b
a+b
1.4. FINDING MINIMA AND MAXIMA
11
Exercises
1. Show that the outer product of two 2-dimensional vectors is a singular matrix.
2. Prove that the eigenvectors or a 2 2 symmetric matrix are orthogonal.
3. Differentiate the function
1
(x a y)
with respect to y.
4. Verify the following matrix identity in IDL:
(A B)> = B> A> .
5. Calculate the eigenvalues and eigenvectors of a non-symmetric matrix with IDL.
6. Plot the function f (x) = x21 x22 with IDL. Find its minima and maxima subject to
the constraint g(x) = x21 + x22 1 = 0.
12
Chapter 2
Image Statistics
It is useful to think of image pixel intensities g(x) as realizations of a random vector G(x)
drawn independently from some probability distribution.
2.1
Random variables
A random variable can be used to represent some quantity which changes in an unpredictable
way each time it is observed. If there is a discrete set of M possible events {Ei }, i = 1 . . . M ,
associated with some random process, let pi be the probability that the ith event Ei will
occur. If ni represents the number of times Ei occurs in n trials, we expect that pi ni /n
in the limit n and that
M
X
pi = 1.
i=1
For example, on the throw of a pair of dice,

{Ei } = (1, 1), (1, 2), (2, 1) . . . (6, 6)
and each event is equally probable
pi = 1/36,
i = 1 . . . 36.
Formally, a random variable X is a real function on the set of possible events:

X = f (Ei ).
If, for example, X is the sum of the points on the dice,
X = f (E1 ) = 2, X = f (E2 ) = 3, X = f (E3 ) = 3, . . . X = f (E36 ) = 12.
On the basis of the probabilities of the individual events, we can associate a distribution
function P (x) with the random variable X, defined by
P (x) = Pr(X x).
For the dice example,
P (1) = 0, P (2) = 1/36, P (3) = 1/12, . . . P (12) = 1.
13
14
CHAPTER 2. IMAGE STATISTICS
For continuous random variables, such as the measured radiance at a satellite sensor, the
distribution function is not expressed in terms of discrete probabilities, but rather in terms
of a probability density function p(x), where p(x)dx is the probability that the value of the
random variable X lies in the interval [x, x + dx]. Then
Z x
P (x) = Pr(X x) =
p(t)dt
and, of course,
P () = 1.
P () = 0,
Two random variables X and Y are said to be independent when

Pr(X x and Y y) = Pr(X x, Y y) = P (x)P (y).
The mean or expected value of a random variable X is written hXi and is defined in
terms of the probability density function:
Z
hXi =
xp(x)dx.
The variance of X, written var(X) is defined as the expected value of the random variable
(X hXi)2 , i.e.

var(X) = (X hXi)2 .
In terms of the probability density function, it is given by
Z
var(X) =
(x hXi)2 p(x)dx.
Two simple but very useful identities follow from the definition of variance:
var(X) = hX 2 i hXi2
var(aX) = a2 var(X).
2.2
(2.1)
The normal distribution
It is often the case that random variables are well-described by the normal or Gaussian
probability density function
1
1
exp( 2 (x )2 ).
2
2
p(x) =
In that case
hXi = ,
var(X) = 2 .
The expected value of pixel intensities

hG (x)i
1
hG2 (x)i
hG(x)i =
..
.
hGN (x)i
2.2. THE NORMAL DISTRIBUTION
15
where x denotes the pixel coordinates, i.e. x = (i, j), is estimated by averaging over all of
the pixels in the image,
c,r
1 X
hG(x)i
g(i, j),
cr i,j=1
referred to as the sample mean vector. It is usually assumed to be independent of x, i.e.
hG(x)i = hGi.
The covariance between bands k and ` is defined according to
cov(Gk , G` ) = h(Gk hGk i)(G` hG` i)i
and is estimated again by averaging over the pixels:
cov(Gk , G` )
c,r
1 X
(gk (i, j) hGk i)(g` (i, j) hG` i),
cr i,j=1
which is called the sample covariance. The covariance is also usually assumed to be independent of x. The variance for bands k is given by

var(Gk ) = cov(Gk , Gk ) = (Gk hGk i)2 .
The random vector G is often assumed to be described by a multivariate normal probability density function p(g), given by

1
1
> 1
p
exp (g ) (g ) .
p(g) =
2
(2)N/2 ||
We indicate this by writing
G N (, ).
The distribution function of the multi-spectral pixels is then completely determined by the
expected value hGi = and by the covariance matrix . In two dimensions, for example,
2

cov(G1 , G2 )
1 12
var(G1 )
=
=
.
cov(G2 , G1 )
var(G2 )
21 22
Note that, since cov(Gk , G` ) = cov(G` , Gk ), the covariance matrix is symmetric, = > .
The covariance matrix can also be written as an outer product:
= h(G hGi)(G hGi)> i.
as can its estimated value:
c,r
1 X
(g(i, j) hGi)(g(i, j) hGi)> .
cr i,j=1
If hGi = 0, we can write simply
= hGG> i.
Another useful identity applies to any linear combination a> G of the random vector G,
namely
var(a> G) = a> a.
(2.2)
16
This is obvious in two dimensions, since we have

var(a> G) = cov(a1 G1 + a2 G2 , a1 G1 + a2 G2 )
= a21 var(G1 ) + a1 a2 cov(G1 , G2 ) + a1 a2 cov(G2 , G1 ) + a22 var(G2 )
.

var(G1 )
cov(G1 , G2 )
a1
= (a1 , a2 )
var(G2 )
a2
cov(G2 , G1 )
Variance is always nonnegative and the vector a in (2.2) is arbitrary, so we have
a> a 0 for all a.
The covariance matrix is therefore said to be positive semi-definite.
The correlation matrix C ispsimilar to the covariance matrix, except that each matrix
element (i, j) is normalized to var(Gi )var(Gj ). In two dimensions

cov(G1 ,G2 )
1
12
1 12
1
var(G1 )var(G2 )
1 2
C=
=
.
=
21
21 1
1
cov(G2 ,G1 )
1
1 2
var(G1 )var(G2 )
The following ENVI/IDL program calculates and prints out the covariance matrix of a
multispectral image:
envi_select, title=Choose multispectral image,fid=fid,dims=dims,pos=pos
if (fid eq -1) then return
num_cols = dims[2]-dims[1]+1
num_rows = dims[4]-dims[3]+1
num_pixels = (num_cols*num_rows)
num_bands = n_elements(pos)
samples=intarr(num_bands,n_elements(num_pixels))
for i=0,num_bands-1 do samples[i,*]=envi_get_data(fid=fid,dims=dims,pos=pos[i])
print, correlate(samples,/covariance,/double)
end
ENVI> .GO
111.46663
82.123236
159.58377
133.80637
82.123236
64.532431
124.84815
104.45298
205.63420
159.58377
124.84815
246.18004
133.80637
104.45298
205.63420
192.70367
2.3
A special function
If n is an integer, the factorial of n is defined by

n! = n(n 1) 1,
1! = 0! = 1.
The generalization of this to non-integers z is the gamma function

Z
(z) =
tz1 et dt.
0
It has the property

(z + 1) = z(z).
2.4. CONDITIONAL PROBABILITIES AND BAYES
THEOREM
17
The factorial is a special case, i.e. for integer n

(n) = n!
A further generalization is the incomplete gamma function
Z x
1
ta1 st dt.
P (a, x) =
(a) 0
It has the properties
P (a, 0) = 0,
P (a, ) = 1.
Here is a plot of P for a = 3 in IDL:

x=findgen(100)/10
envi_plot_dtat,x,igamma(3,x)
Figure 2.1: The incomplete gamma function.

We are interested in this function for the following reason. Suppose that the random
variables Xi , i = 1 . . . n, are independent normally distributed with zero mean and variance
i2 . Then the random variable
2
n
X
Xi
Z=
i
i=1
has the distribution function
P (z) = P r(Z z) = P (n/2, z/2),
and is said to be chi-square distributed with n degrees of freedom.
2.4
Conditional probabilities and Bayes

Theorem
If A and B are two events such that the probability of A andB occurring simultaneously is
P (A, B), then the conditional probability of A occuring given that B has occurred is
P (A | B) =
P (A, B)
.
P (B)
18
Bayes Theorem (named after Rev. Thomas Bayes, an 18th century mathematician who
derived a special case) is the basic starting point for inference problems using probability
theory as logic. We will use it in the following form. Let X be a random variable describing
a pixel intensity, and let {Ck | k = 1 . . . M } be a set of possible classes for the pixels. Then
the a posteriori conditional probability for class Ck , given the measured pixel intensity x is
P (Ck |x) =
P (x|Ck )P (Ck )
,
P (x)
(2.3)
where
P (Ck ) is the prior-probability for class Ck ,
P (x|Ck ) is the conditional probability of observing the value x, if it belongs to class Ck ,
PM
P (x) = k=1 p(x|Ck )p(Ck ) is the total probability for x.
2.5
Linear regression
Applying radiometric corrections to digital images often involves fitting a set of m data
points (xi , yi ) to a straight line:
y(x) = a + bx + .
Suppose that the measurements yi include a random error with variance 2 and that the
measurements xi are exact. Define a goodness of fit function
2
m
X
yi a bxi
2
(a, b) =
.
(2.4)
i=1
If the random variable is normally distributed, then we obtain the most likely (i.e. best)
values for a and b by minimizing this function, that is, by solving the equations
2
2
=
= 0.
a
b
The solution is
b = sxy ,
s2xx
where
a
= y b
x,
sxy =
1 X
(xi x
)(yi y)
m i=1
s2xx =
1 X
(xi x
)2
m i=1
(2.5)
1 X
xi ,
m i=1
m
x
=
1 X
yi .
m i=1
m
y =
The uncertainties in the estimates a

and b are given by
P
2 x2i
2
P
a = P 2
m xi ( xi )2
.
2 m
2
P
b = P 2
m xi ( xi )2
(2.6)
2.5. LINEAR REGRESSION
19
If 2 is not known a priori, then it can be estimated by

1 X
(yi a
bxi )2 .
m 2 i=1
m
2 =
Generalized and orthogonal least squares methods are described in Appendix A. A

recusive procedure is described in Appendix C.
20
Exercises
1. Write the multivariate normal probability density function p(g) for the case = 2 I.
Show that probability density function for a one-dimensional random variable G is a
special case. Prove that hGi = .
2. In the Monty Hall game a contestant is asked to choose between one of three doors.
Behind one of the doors is an automobile as prize for choosing the correct door. After
the contestant has chosen, Monty Hall opens one of the other two doors to show that
the automobile is not there. He then asks the contestant if she wishes to change her
mind and choose the other unopened door. Use Bayes theorem to prove that her
correct answer is yes.
3. Derive the uncertainty for a in (2.6) from the formula for error propagation
a2
N
X
i=1
f
yi
2
.
Chapter 3
Transformations
Up until now we have thought of multispectral images as (r c N )-dimensional arrays
of measured pixel intensities. In the present chapter we consider other representations of
images which are often useful in image analysis.
3.1
Fourier transforms
Figure 3.1: Fourier series approximation of a sawtooth function. The series was truncated
at k = 4. The left hand side shows the intensities |
x(k)|2 .
A periodic function x(t) with period T ,
x(t) = x(t + T )
can always be expressed as the infinite Fourier series
x(t) =
x
(k)ei2(kf )t ,
(3.1)
k=
where f = 1/T = /2 and eix = cos x + i sin x. From the orthogonality of the e-functions,
the coefficients x
(k) in the expansion are given by
Z 1/2f
x
(k) = f
x(t)ei2(kf )t dt.
(3.2)
1/2f
21
22
CHAPTER 3. TRANSFORMATIONS
Figure 3.1 shows an example for the sawtooth function with period T = 1:
x(t) = t, 1/2 t < 1/2.
Parsevals formula follows directly from (3.2)
Z
X
|
x(k)|2 = f
k
3.1.1
1/2f
(x(t))2 dt.
1/2f
Discrete Fourier transform
Let g(j) be a discrete sample of the real function g(x) (a row of pixels), sampled c times at
the sampling interval over a complete period T , i.e.
g(j) = g(x = j),
j = 0 . . . c 1.
The corresponding discrete Fourier series is written

g(j) =
c/2
1 X
g(k)ei2(kf )(j) , j = 0 . . . c 1,
c
(3.3)
k=c/2
where the truncation frequency 2c f is the highest frequency component that can be determined by the sampling. This frequency is called the Nyquist critical frequency and is given
by 1/2, so that f is determined by
cf
1
=
2
2
or
f=
1
.
c
(This corresponds to sampling over one complete period: c = T .) Thus (3.3) becomes
c/2
1 X
g(k)ei2kj/c ,
g(j) =
c
j = 0 . . . c 1.
k=c/2
With the observation

ei2(c/2)j/c = eij = (1)c = eij = ei2(c/2)j/c ,
we can write this as
g(j) =
c/21
1 X
g(k)ei2kj/c ,
c
j = 0 . . . c 1,
k=c/2
a set of c equations in the c unknown frequency components g(k). Equivalently,

g(j) =
c/21
1
1 X
1 X
g(k)e2kj/c +
g(k)ei2kj/c
c
c
k=0
k=c/2
c/21
c1
0
1 X
1 X
g(k)ei2kj/c +
X(k 0 c)ei2(k c)j/c
c
c 0
c1
1 X
1 X
g(k)ei2kj/c +
g(k c)ei2kj/c .
c
c
k=0
k =c/2
c/21
k=0
k=c/2
3.2. WAVELETS
23
Thus we can write
1X
g(k)ei2kj/c ,
c
c1
g(j) =
j = 0 . . . c 1,
(3.4)
k=0
if we interpret g(k) g(k c) when k c/2.

The solution to (3.4) for the complex frequency components g(k) is called the discrete
Fourier transform and is given by
g(k) =
c1
X
g(j)ei2kj/c ,
k = 0 . . . c 1.
(3.5)
j=0
This follows from the following orthogonality property:

c1
X
ei2(kk )j/c = ck,k0 .
(3.6)
j=0
Eq. (3.4) itself is the discrete inverse Fourier transform. The discrete analog of Parsivals
formula is
c1
c1
X
1X
|
g (k)|2 =
g(j)2 .
(3.7)
c j=0
k=0
Determining the frequency components in (3.5) would appear to involve, in all, c2 floating
point multiplication operations. The fast Fourier transform (FFT) exploits the structure of
the complex e-functions to reduce this to order c log c, see for example [PFTV86].
3.1.2
Discrete Fourier transform of an image
The discrete Fourier transform is easily generalized to two dimensions for the purpose of
image analysis. Let g(i, j), i, j = 0 . . . c 1, represent a (quadratic) gray scale image. Its
discrete Fourier transform is
g(k, `) =
c1 X
c1
X
g(i, j)ei2(ik+j`)/c
(3.8)
i=0 j=0
and the corresponding inverse transform is

g(i, j) =
c1 c1
1 XX
g(k, `)ei2(ik+j`)/c .
c2
(3.9)
k=0 `=0
3.2
Wavelets
Unlike the Fourier transform, which represents a signal (array of pixel intensities) in terms
of pure frequency functions, the wavelet transform expresses the signal in terms of functions
which are restricted both in terms of frequency and spatial extent. In many applications,
this turns out to be particularly efficient and useful. Well see an example of this in Chapter
7, where we discuss image fusion in more detail. The wavelet transform is discussed in
Appendix B.
24
3.3
Principal components
The principal components transformation forms linear combinations of multispectral pixel

intensities which are mutually uncorrelated and which have maximum variance.
We assume without loss of generality that hGi = 0, so that the covariance matrix of a
multispectral image is is = hGG> i, and look for a linear combination Y = a> G with
maximum variance, subject to the normalization condition a> a = 1. Since the covariance
of Y is a> a, this is equivalent to maximizing an unconstrained Lagrange function, see
Section 1.4,
L = a> a 2(a> a 1).
The maximum of L occurs at that value of a for which L
a = 0. Recalling the rules for vector
differentiation,
L
= 2a 2a = 0
a
which is the eigenvalue problem
a = a.
Since is real and symmetric, the eigenvectors are orthogonal (and normalized). Denote
them a1 . . . aN for eigenvalues 1 . . . N . Define the matrix
A = (a1 . . . aN ),
AA> = I,
and let the the transformed principal component vector be Y = A> G with covariance matrix
0 . Then we have
0 = hYY> i = hA> GG> Ai
1
0
= A> A = Diag(1 . . . N ) =
...
0
2
..
.
..
.
0
0
=: .
..
.
N
The fraction of the total variance in the original multispectral image which is described by
the first i principal components is
1 + . . . + i
.
1 + . . . + i + . . . + N
If the original multispectral channels are highly correlated, as is usually the case, the first
few principal components will account for a very high percentage of the variance the image.
For example, a color composite of the first 3 principal components of a LANDSAT TM
scene displays essentially all of the information contained in the 6 spectral components in
one single image. Nevertheless, because of the approximation involved in the assumption
of a normal distribution, higher order principal components may also contain significant
information [JRR99].
The principal components transformation can be performed directly from the ENVI main
menu. However the following IDL program illustrates the procedure in detail:
; Principal components analysis
envi_select, title=Choose multispectral image, $
3.4. MINIMUM NOISE FRACTION
25
fid=fid, dims=dims,pos=pos
if (fid eq -1) then return
num_cols = dims[2]+1
num_lines = dims[4]+1
num_pixels = (num_cols*num_lines)
num_channels = n_elements(pos)
image=intarr(num_channels,num_pixels)
for i=0,num_channels-1 do begin
temp=envi_get_data(fid=fid,dims=dims,pos=pos[i])
m = mean(temp)
image[i,*]=temp-m
endfor
; calculate the transformation matrix A
sigma = correlate(image,/covariance,/double)
lambda = eigenql(sigma,eigenvectors=A,/double)
print,Covariance matrix
print, sigma
print,Eigenvalues
print, lambda
print,Eigenvectors
print, A
; transform the image
image = image##transpose(A)
; reform to BSQ format
PC_array = bytarr(num_cols,num_lines,num_channels)
for i = 0,num_channels-1 do PC_array[*,*,i] = $
reform(image[i,*],num_cols,num_lines,/overwrite)
; output the result to memory
envi_enter_data, PC_array
end
3.4
Minimum noise fraction
Principal components analysis maximizes variance. This doesnt always lead to images of
decreasing image quality (i.e. of increasing noise). The MNF transformation minimizes the
noise content rather than maximizing variance, so, if this is the desired criterion, it is to be
preferred over PCA.
Suppose we can represent a gray scale image G with covariance matrix and zero mean
as a sum of uncorrelated signal and noise noise components
G = S + N,
26
both normally distributed, with covariance matrices S and N and zero mean. Then we
have
= hGG> i = h(S + N)(S + N)> i = hSS> i + hNN> i,
since noise and signal are uncorrelated, i.e. hSN> i = hNS> i = 0. Thus
= S + N .
(3.10)
Now let us seek a linear combination a> G for which the signal to noise ratio
SNR =
var(a> S)
a > S a
= >
>
var(a N)
a N a
is maximized. From (3.10) we can write this in the form

SN R =
a> a
1.
a> N a
(3.11)
Differentiating we get
1
1
a> a 1
SNR = >
a >
N a = 0,
a
a N a 2
(a N a)2 2
or, equivalently,
(a> N a)a = (a> a)N a .
This condition is met when a solves the generalized eigenvalue problem

N a = a.
(3.12)
Both N and are symmetric and the latter is also positive definite. Its Cholesky factorization is
= LL> ,
where L is a lower triangular matrix, and can be thought of as the square root of . Such
an L always exists is is positive definite. With this, we can write (3.12) as
N a = LL> a
or, equivalently,
L1 N (L> )1 L> a = L> a
or, with b = L> a and commutivity of inverse and transpose,

[L1 N (L1 )> ]b = b,
a standard eigenproblem for a real, symmetric matrix L1 N (L1 )> .
From (3.11) we see that the SNR for eigenvalue i is just
SNRi =
a>
i ai
>
ai (i ai )
1=
1
1.
i
Thus the eigenvector ai corresponding to the smallest eigenvalue i will maximize the signal
to noise ratio. Note that (3.12) can be written in the form
N A = A,
(3.13)
3.4. MINIMUM NOISE FRACTION
27
where A = (a1 . . . aN ) and = Diag(1 . . . N ).

The MNF transformation is available in the ENVI environment. It is carried out in
two steps which are equivalent to the above. First of all the noise contribution to G is
whitened, i.e. the random vector N has covariance matrix I, the identity matrix. Since
N can be assumed to be diagonal anyway (the noise in any band is uncorrelated with the
noise in any other band), we accomplish this by doing a transformation which divides the
components of G by the standard deviations of the noise,
1/2
X = N
where
1/2
G,
1/2
N N
= I.
The transformed random vector X thus has covariance matrix

1/2
X = N
1/2
(3.14)
Next we do an ordinary principal components transformation on X, i.e.

Y = B> X
where
B> X B = X ,
B> B = I.
(3.15)
The overall transformation is thus

1/2
Y = B> N
G = A> G
1/2
where A = N B is not an orthogonal transformation. To see that this transformation is

equivalent to solving the generalized eigenvalue problem, consider
1/2
N A = N N
=
=
=
1/2
N X B1
X
1/2 1/2
1/2
N N N B1
X
A1
.
X
This is equivalent to (3.13) with

Xi =
1
= SNRi + 1.
i
Thus an eigenvalue in the second transformation equal to one corresponds to pure noise.
Before the transformation can be performed, it is of course necessary to estimate the
noise covariance matrix N . This can be done for example by differencing with respect to
the local mean:
(N )k`
c,r
1 X
(gk (i, j) mk (i, j))(g` (i, j) m` (i, j))
cr i,j
where mk (i, j) is the local mean of pixels in some neighborhood of (i, j).
28
3.5
Maximum autocorrelation factor (MAF)
Let x represent the coordinates of a pixel within image G, i.e. x = (i, j). We consider the
covariance matrix between the original image, represented by G(x), and the same image
G(x + ) shifted by an amount = (x , y )> :
() = hG(x)G(x + )> i,
assumed to be independent of x. Then
(0) = ,
and furthermore
() = hG(x)G(x )> i
= hG(x + )G(x)> i
= h(G(x)G(x + )> )> i
= ()> .
Now we consider the covariance of projections of the original and shifted images:
cov(a> G(x), a> G(x + )) = a> hG(x)G(x + )> ia
= a> ()a
= a> ()a
1
= a> (() + ())a.
2
(3.16)
Define as the covariance matrix of the difference image G(x) G(x + ), i.e.
= h(G(x) G(x + ))(G(x) G(x + )> i
= hG(x)G(x)> i + hG(x + )G(x + )> i hG(x)G(x + )> i
hG(x + )G(x)> i
= 2 () ().
Hence () + () = 2 and we can write (3.16) in the form
1
cov(a> G(x), a> G(x + )) = a> a a> a.
2
The correlation of the projections is therefore given by
a> a 12 a> a
corr(a> G(x), a> G(x + )) = p
var(a> G(x))var(a> G(x + ))
a> a 12 a> a
= p
(a> a)(a> a)
=1
(3.17)
1 a> a
.
2 a> a
We want to determine that vector a which extremalizes this correlation, so we wish to

extremalize the function
a> a
.
R(a) = >
a a
3.5. MAXIMUM AUTOCORRELATION FACTOR (MAF)

Differentiating,
or
29
1
1
R
a> a 1
= >
a >
a = 0
a
a a 2
(a a)2 2
(a> a) a = (a> a)a.
This condition is met when a solves the generalized eigenvalue problem

a = a,
(3.18)
which is seen to have the same form as (3.12). Again both and are symmetric and
the latter is also positive definite and we obtain the standard eigenproblem
[L1 (L1 )> ]b = b,
for the real, symmetric matrix L1 (L1 )> .
Let the eigenvalues be 1 . . . N and the corresponding (orthogonal) eigenvectors be
bi . We have
>
>
>
i 6= j,
0 = b>
i bj = ai LL aj = ai aj ,
and therefore
>
>
cov(a>
i G(x), aj G(x)) = ai aj = 0,
i 6= j,
so that the MAF components are orthogonal (uncorrelated). Moreover with equation (2.14)
we have
1
>
corr(a>
i G(x), ai G(x + )) = 1 i ,
2
and the first MAF component has minimum autocorrelation.
An ENVI plug-in for performing the MAF transformation is given in Appendix D.5.2.
30
Exercises
1. Show that, for x(t) = sin(2t) in Eq. (2.2),
x
(1) =
1
,
2i
x
(1) =
1
,
2i
and x
(k) = 0 otherwise.
2. Calculate the discrete Fourier transform of the sequence 2, 4, 6, 8 from (3.4). You have
to solve four simultaneous equations, the first of which is
2=

1
g(0) + g(1) + g(2) + g(3) .
4
Verify your result in IDL with the command

print, FFT([2,4,6,8])
Chapter 4
Radiometric enhancement
4.1
Lookup tables
Figure 4.1: Contrast enhancement with a lookup table represented as the continuous function
f (x) [JRR99].
Intensity enhancement of an image is easily accomplished by means of lookup tables. For
byte-encoded data, the pixel intensities g are used to index an array
LU T [k],
k = 0 . . . 255,
the entries of which also lie between 0 and 255. These entries can be chosen to implement
linear stretching, saturation, histogram equalization, etc. according to
gk (i, j) = LU T [gk (i, j)],
0 i r 1, 0 j c 1.
31
32
CHAPTER 4. RADIOMETRIC ENHANCEMENT
It is also useful to think of the the lookup table as an approximately continuous function
y = f (x).
If hin (x) is the histogram of the original image and hout (y) is the histogram of the image
after transformation through the lookup table, then, since the number of pixels is constant,
hout (y) dy = hin (x) dx,
see Fig.4.1
4.1.1
Histogram equalization
For histogram equalization, we want hout (y) to be constant independent of y. Hence

dy hin (x) dx
Z
and
y = f (x)
hin (t)dt.
0
The lookup table y for histogram equalization is thus proportional to the cumulative sum
of the original histogram.
4.1.2
Histogram matching
Figure 4.2: Steps required for histogram matching [JRR99].

It is often desirable to match the histogram of one image to that of another so as to
make their apparent brightnesses as similar as possible, for example when the two images
4.2. CONVOLUTIONS
33
are combined in a mosaic. We can do this by first equalizing both the input histogram
hin (x) and the reference histogram href (y) with the cumulative lookup tables z = f (x) and
z = g(y), respectively. The required lookup table is then
y = g 1 (z) = g 1 (f (x)).
The necessary steps for implementing this function are illustrated in Fig. 1.5 taken from
[JRR99].
4.2
Convolutions
With the convention

= 2k/c
we can write (3.5) in the form
g() =
c1
X
g(j)eij .
(4.1)
j=0
The convolution of g with a filter h = (h(0), h(1), . . .) is defined by

X
f (j) =
h(k)g(j k) =: h g,
(4.2)
where the sum is over all nonzero elements of the filter h. If the number of nonzero elements
is finite, we speak of a finite impulse response filter (FIR).
Theorem 1 (Convolution theorem) In the frequency domain, convolution is replaced by
multiplication: f() = h()

g ().
Proof:
f() =
f (j)eij =
h()
g () =
j,k
X
k
ik
h(k)e
h(k)g(j k)eij
!
i`
g(`)e
h(k)g(`)ei(k+`)
k,`
h(k)g(j k)eij = f().
k,j
This can of course be generalized to two dimensional images, so that there are three
basic steps involved in image filtering:
1. The image and the convolution filter are transformed from the spatial domain to the
frequency domain using the FFT.
2. The transformed image is multiplied with the frequency filter.
3. The filtered image is transformed back to the spatial domain.
34
We often distinguish between low-pass and high-pass filters. Low pass filters perform
some sort of averaging. The simplest example is
h = (1/2, 1/2, 0 . . .),
which computes the average of two consecutive pixels. A high-pass filter computes differences
of nearby pixels, e.g.
h = (1/2, 1/2, 0 . . .).
Figure 4.3 shows the Fourier transforms of these two simple filters generated by the the IDL
program
; Hi-Lo pass filters
x = fltarr(64)
x[0]=0.5
x[1]=-0.5
p1 =abs(FFT(x))
x[1]=0.5
p2 =abs(FFT(x))
envi_plot_data,lindgen(64),[[p1],[p2]]
end
Figure 4.3: Low-pass(red) and high-pass (white) filters in the frequency domain. The quan2
tity |h(k)|
is plotted as a function of k. The highest frequency is at the center of the plots,
k = c/2 = 32 .
4.2.1
Laplacian of Gaussian filter
We shall illustrate image filtering with the so-called Laplacian of Gaussian (LoG) filter,
which will be used in Chapter 6 to implement contour matching for automatic determination
of ground control points. To begin with, consider the gradient operator for a two-dimensional
image:
=
=i
+j
,
x
x1
x2
4.2. CONVOLUTIONS
35
where i and j are unit vectors in the vertical and horizontal directions, respectively. g(x)
is a vector in the direction of the maximum rate of change of gray scale intensity. Since the
intensity values are discrete, the partial derivatives must be approximated. For example we
can use the Sobel operators:
g(x)
[g(i 1, j 1) + 2g(i, j 1) + g(i + 1, j 1)]
x1
[g(i 1, j + 1) + 2g(i, j + 1) + g(i + 1, j + 1)] = 2 (i, j)
g(x)
[g(i 1, j 1) + 2g(i 1, j) + g(i 1, j + 1)]
x2
[g(i + 1, j 1) + 2g(i + 1, j) + g(i + 1, j + 1)] = 1 (i, j)
which are equivalent to the two-dimensional FIR filters
1
h1 = 2
1
0
0
0
1
2
1
1
and h2 = 0
1
2
0
2
1
0 ,
1
respectively. The magnitude of the gradient is

q
|| = 21 + 22 .
Edge detection can be achieved by calculating the filtered image
f (i, j) = ||(i, j)
and setting an appropriate threshold.
Figure 4.4: Laplacian of Gaussian filter.
36
Now consider the second derivatives of the image intensities, which can be represented
formally by the Laplacian
2
2
+ 2.
2 = > =
2
x1
x2
2 g(x) is a scalar quantity which is zero whenever the gradient is maximum. Therefore
changes in intensity from dark to light or vice versa correspond to sign changes in the
Laplacian and these can also be used for edge detection. The Laplacian can also be approximated by a FIR filter, however such filters tend to be very sensitive to image noise.
Usually a low-pass Gauss filter is first used to smooth the image before the Laplacian filter
is applied. It is more efficient, however, to calculate the Laplacian of the Gauss function
itself and then use the resulting function to derive a high-pass filter. The Gauss function in
two dimensions is given by
1
1
exp 2 (x21 + x22 ),
2
2
2
where the parameter determines its extent. Its Laplacian is

1
1
2
2
2
2
2
(x
+
x
2
)
exp
(x
+
x
)
2
2
2 6 1
2 2 1
a plot of which is shown in Fig. 4.4.
The following program illustrates the application of the filter to a gray scale image, see
Fig. 4.5:
pro LoG
sigma = 2.0
filter = fltarr(17,17)
for i=0L,16 do for j=0L,16 do $
filter[i,j] = (1/(2*!pi*sigma^6))*((i-8)^2+(j-8)^2-2*sigma^2) $
*exp(-((i-8)^2+(j-8)^2)/(2*sigma^2))
; output as EPS file
thisDevice =!D.Name
set_plot, PS
Device, Filename=c:\temp\LoG.eps,xsize=4,ysize=4,/inches,/Encapsulated
shade_surf,filter
device,/close_file
set_plot, thisDevice
; read a jpg image
filename = Dialog_Pickfile(Filter=*.jpg,/Read)
OK = Query_JPEG(filename,fileinfo)
if not OK then return
xsize = fileinfo.dimensions[0]
ysize = fileinfo.dimensions[1]
window,11,xsize=xsize,ysize=ysize
Read_JPEG,filename,image1
image = bytarr(xsize,ysize)
4.2. CONVOLUTIONS
37
image[*,*] = image1[0,*,*]
tvscl,image
; run the filter
filt = image*0.0
filt[0:16,0:16]=filter[*,*]
image1= float(fft(fft(image)*fft(filt),1))
; get zero-crossings and display
image2 = bytarr(xsize,ysize)
indices = where( (image1*shift(image1,1,0) lt 0) or (image1*shift(image1,0,1) lt 0) )
image2[indices]=255
wset, 11
tv, image2
end
Figure 4.5: Image filtered with the Laplacian of Gaussian filter.
38
Chapter 5
Topographic modelling
Satellite images are two-dimensional representations of the three-dimensional earth surface.
The correct treatment of the third dimension the elevation is essential for terrain modelling and accurate georeferencing.
5.1
RST transformation
Transformations of spatial coordinates1 in 3 dimensions which involve only rotations, scaling

and translations can be represented by a 4 4 transformation matrix A
v = Av
(5.1)
where v is the column vector containing the original coordinates

v = (X, Y, Z, 1)>
and v contains the transformed coordinates
v = (X , Y , Z , 1)> .
For example the translation
X = X + X0
Y = Y + Y0
Z = Z + Z0
corresponds to the transformation matrix
1
0
T=
0
0
a uniform scaling by 50% to
1/2
0
S=
0
0
1 The
0 0
1 0
0 1
0 0
0
1/2
0
0
X0
Y0
,
Z0
1
0
0
1/2
0
0
0
,
0
1
following treatment closely follows Chapter 2 of Gonzales and Woods [GW02].
39
40
CHAPTER 5. TOPOGRAPHIC MODELLING
and a simple rotation about the Z-axis to
cos
sin
R =
0
0
sin
cos
0
0
0 0
0 0
,
1 0
0 1
etc. The complete RST transformation is then

v = RSTv = Av.
(5.2)
The inverse transformation is of course represented by A1 .
5.2
Imaging transformations
An imaging (or perspective) transformation projects 3D points onto a plane. It is used to

describe the formation of a camera image and, unlike the RST transformation, is non-linear
since it involves division by coordinate values.
Figure 5.1: Basic imaging process, from [GW02].

In Figure 5.1, the camera coordinate system (x, y, x) is aligned with the world coordinate
system, describing the terrain to be imaged. The camera focal length is . From simple geometry we obtain expressions for the image plane coordinates in terms of the world
coordinates:
X
x=
Z
(5.3)
Y
y=
.
Z
Solving for the X and Y world coordinates:
x
( Z)
y
Y = ( Z).
X=
(5.4)
5.3. CAMERA MODELS AND RFM APPROXIMATIONS
41
Thus, in order to extract the geographical coordinates (X, Y ) of a point on the earths
surface from its image coordinates, we require knowledge of the elevation Z. Correcting for
the elevation in this way constitutes the process of orthorectification.
5.3
Camera models and RFM approximations
Equation (5.3) is overly simplified, as it assumes that the origin of world and image coordinates coincide. In order to apply it, one has first to transform the image coordinate system
from the satellite to the world coordinate system. This is done in a straightforward way
with the rotation and translation transformations introduced in Section 5.1. However it
requires accurate knowledge of the height and orientation of the satellite imaging system at
the time of the image acquisition (or, more exactly, during the acquisition, since the latter
is normally not instantaneous). The resulting non-linear equations that relate image and
world coordinates are what constitute the camera or sensor model for that particular image.
Direct use of the camera model for image processing is complicated as it requires extremely exact, sometimes proprietary information about the sensor system and its orbit.
An alternative exists if the image provider also supplies a so-called rational function model
(RFM) which approximates the camera model for each acquisition as a ratio of rational
polynomials, see e.g. [TH01]. Such RFMs have the form
a(X 0 , Y 0 , Z 0 )
b(X 0 , Y 0 , Z 0 )
c(X 0 , Y 0 , Z 0 )
c0 = g(X 0 , Y 0 , Z 0 ) =
d(X 0 , Y 0 , Z 0 )
r0 = f (X 0 , Y 0 , Z 0 ) =
(5.5)
where c0 and r0 are the column and row (XY) coordinates in the image plane relative to an
origin (c0 , r0 ) and scaled by a factor cs resp. rs :
c0 =
c c0
,
cs
r0 =
r r0
.
rs
Similarly X 0 , Y 0 and Z 0 are relative, scaled world coordinates:

X0 =
X X0
,
Xs
Y0 =
Y Y0
,
Ys
Z0 =
Z Z0
.
Zs
The polynomials a, b, c and d are typically to third order in the world coordinates, e.g.
a(X, Y, Z) = a0 + a1 X + a2 Y + a3 Z + a4 XY + a5 XZ + a6 Y Z + a7 X 2 + a8 Y 2 + a9 Z 2
+ a10 XY Z + a11 X 3 + a12 XY 2 + a13 XZ 2 + a14 X 2 Y + a15 Y 3 + a16 Y Z 2
+ a17 X 2 Z + a18 Y 2 Z + a19 Z 3
The advantage of using ratios of polynomials is that these are less subject to interpolation
error.
For a given acquisition the provider fits the RFM to his camera model using a threedimensional grid of points covering the image and world spaces with a least squares fitting
procedure. The RFM is capable of representing the camera model extremely well and can
be used as a replacement for it. Both Space Imaging and Digital Globe provide RFMs with
their high resolution IKONOS and QuickBird imagery. Below is a sample Quickbird RFM
file giving the origins, scaling factors and polynomial coefficients needed in Eq. (5.5).
42
satId = "QB02";
bandId = "P";
SpecId = "RPC00B";
BEGIN_GROUP = IMAGE
errBias =
56.01;
errRand =
0.12;
lineOffset = 4683;
sampOffset = 4154;
latOffset =
32.5709;
51.8391;
longOffset =
heightOffset = 1582;
lineScale = 4733;
sampScale = 4399;
latScale =
0.0256;
longScale =
0.0269;
heightScale = 500;
lineNumCoef = (
+1.162844E-03,
-7.011681E-03,
-9.993482E-01,
-1.119999E-02,
-6.682911E-06,
+7.591306E-05,
+3.632740E-04,
-1.111298E-04,
-5.842086E-04,
+2.212466E-06,
-1.275349E-06,
+1.279061E-06,
+1.918762E-08,
-6.957548E-07,
-1.240783E-06,
-7.644403E-07,
+3.479752E-07,
+1.259300E-05,
+1.085128E-06,
-1.571375E-06);
lineDenCoef = (
+1.000000E+00,
+1.801541E-06,
+5.822024E-04,
+3.774278E-04,
-2.141015E-08,
-6.984359E-07,
-1.344888E-06,
-9.669251E-07,
-4.726988E-08,
+1.329814E-06,
+2.113403E-08,
-2.914653E-06,
5.3. CAMERA MODELS AND RFM APPROXIMATIONS

-4.367422E-07,
+6.988065E-07,
+4.035593E-07,
+3.275453E-07,
-2.740827E-07,
-4.147675E-06,
-1.074015E-06,
+2.218804E-06);
sampNumCoef = (
-9.783496E-04,
+9.566915E-01,
-8.477919E-03,
-5.393803E-02,
-1.590864E-04,
+5.477412E-04,
-3.968308E-04,
+4.819512E-04,
-3.965558E-06,
-3.442885E-05,
+5.821180E-08,
+2.952683E-08,
-1.363146E-07,
+2.454422E-07,
+1.372698E-07,
+1.987710E-07,
-3.167074E-07,
-1.038018E-06,
+1.376092E-07,
-2.352636E-07);
sampDenCoef = (
+1.000000E+00,
+5.029785E-04,
+1.225257E-04,
-5.780883E-04,
-1.543054E-07,
+1.240426E-06,
-1.830526E-07,
+3.264812E-07,
-1.255831E-08,
-5.177631E-07,
-5.868514E-07,
-9.029287E-07,
+7.692317E-08,
+1.289335E-07,
-3.649242E-07,
+0.000000E+00,
+1.229000E-07,
-1.290467E-05,
+4.318970E-08,
-8.391348E-08);
43
44
END_GROUP = IMAGE
END;
To illustrate a simple use of the RFM data, consider a vertical structure in a highresolution image, such as a chimney or building fassade. Suppose we determine the image
coordinates of the bottom and top of the structure to be (rb , cb ) and (rt , ct ), respectively.
Then from 5.5
rb = f (X, Y, Zb )
cb = g(X, Y, Zb )
rt = f (X, Y, Zt )
(5.6)
ct = g(X, Y, Zt ),
since the (X, Y ) coordinates must be the same. This would appear to constitute a set of
four equations in four unknowns X, Y , Zb and Zt , however the solution is unstable because
of the close similarity of Zt to Zb . Nevertheless the object height Zt Zb can be obtained
by the following procedure:
1. Get (rb , cb ) and (rt , ct ) from the image.
2. Solve first two equations in (5.6) (e.g. with Newtons method) for X and Y with Zb
set equal to the average elevation in the scene if no DEM is available, otherwise to the
true elevation.
3. For a spanning range of Zt0 values, calculate (rt0 , c0t ) from the second two equations in
(5.6) and choose for Zt the value of Zt0 which gives closest agreement to the values
read in.
Quite generally, the RFM can approximate the camera model very well and can be used
as an alternative for providing end users with the necessary information to perform their
own photogrammetric processing. An ENVI plug-in for object height determination
from RFM data is given in Appendix D.2.1.
5.4
Stereo imaging, elevation models and

orthorectification
The missing elevation information Z in (5.3) or in (5.5) can be obtained with stereoscopic
imaging techniques. Figure 5.2 shows two cameras viewing the same world point w from
two positions. The separation of the lens centers is the baseline. The objective is to find
the coordinates (X, Y, Z) of w if its image points have coordinates (x1 , y1 ) and (x2 , y2 ). We
assume that the cameras are identical and that their image coordinate systems are perfectly
aligned, differing only in the location of their origins. The Z coordinate of w is the same for
both coordinate systems.
In Figure 5.3 the first camera is brought into coincidence with the world coordinate
system. Then from (5.4),
x1
X1 =
( Z).
Alternatively, if the second camera is brought to the origin of the world coordinate system,
x2
X2 =
( Z).
5.4. STEREO IMAGING, ELEVATION MODELS AND ORTHORECTIFICATION 45
Figure 5.2: The stereo imaging process, from [GW02].
Figure 5.3: Top view of Figure 5.2, from [GW02].
46
But, from the figures,

X2 = X1 + B,
where B is the baseline. We have from the above three equations:
Z =
B
.
x2 x1
(5.7)
Thus if the displacement of the image coordinates of the point w, namely x2 x1 can be
determined, the Z coordinate can be calculated. The task is then to find two corresponding points in different images of the same scene. This is usually accomplished by spatial
correlation techniques and is closely related to the problem of image-to-image registration
discussed in the next chapter.
Figure 5.4: ASTER stereo acquisition geometry.

Because the stereo image must be correlated, best results are obtained if they are acquired
within a very short time of each other, preferably along track if a single platform is used,
see Figure 5.4. This figure shows the orientation and imaging geometry of the VNIR 3N and
3B cameras on the ASTER platform for acquiring a stereo full scene. The satellite travels at

a speed of 6.7 km/sec at a height of 705 km. A 60 60 km2 full scene is scanned in 9 seconds.
55 seconds later the same scene is scanned by the back-looking camera, corresponding to a
baseline of 370 km. The along-track geometry means that the stereo pair is unipolar, that
is, the displacements due to viewing angle are only along the y axis in the imaging plane.
Therefore the spatial correlation algorithm used to match points can be one dimensional. If
carried out on a pixel for pixel basis, one obtains a digital elevation model (DEM).
Figure 5.5: ASTER 3N nadir camera image.
Figure 5.6: ASTER 3B back-looking camera image.

As an example, Figures 5.5 and 5.6 show an ASTER stereo pair. Both images have been
rotated so as to make them unipolar.
48

The following IDL program calculates a very rudimentary DEM:
pro test_correl_images
height = 705.0
base = 370.0
pixel_size = 15.0
envi_select, title=Choose 1st image, fid=fid1, dims=dims1, pos=pos1, /band_only
envi_select, title=Choose 2nd image, fid=fid2, dims=dims2, pos=pos2, /band_only
im1 = envi_get_data(fid=fid1,dims=dims1,pos=pos1)
im2 = envi_get_data(fid=fid2,dims=dims2,pos=pos2)
n_cols = dims1[2]-dims1[1]+1
n_rows = dims1[4]-dims1[3]+1
parallax = fltarr(n_cols,n_rows)
progressbar = Obj_New(progressbar, Color=blue, Text=0,$
title=Cross correlation, column ...,xsize=250,ysize=20)
progressbar->start
for i=7L,n_cols-8 do begin
if progressbar->CheckCancel() then begin
envi_enter_data,pixel_size*parallax*(height/base)
progressbar->Destroy
return
endif
progressbar->Update,(i*100)/n_cols,text=strtrim(i,2)
for j=25L,n_rows-26 do begin
cim = correl_images(im1[i-5:i+5,j-5:j+5],im2[i-7:i+7,j-25:j+25], $
xoffset_b=0,yoffset_b=-20,xshift=0,yshift=20)
corrmat_analyze,cim,xoff,yoff,m,e,p
parallax[i,j] = yoff > (-5.0)
endfor
endfor
progressbar->destroy
envi_enter_data,pixel_size*parallax*(height/base)
end
This program makes use of the routines correl images and corrmat analyze from the IDL
Astronomy Users Library2 to calculate the cross-correlation of the two images. For each
pixel in the nadir image an 11 11 window is moved along an 11 51 window in the backlooking image centered at the same position. The point of maximum correlation defines the
parallax or displacement p. This is related to the relative elevation e of the pixel according
to
h
e = p 15m,
b
where h is the height of the sensor and b is the baseline, see Figure 5.7.
Figure 5.8 shows the result. Clearly there are many problems due to the correlation
errors, however the relative elevations are approximately correct when compared to the
DEM determined with the ENVI commercial add-on AsterDTM, see Figure 5.9.
2 www.astro.washington.edu/deutsch/idl/htmlhelp/index.html
back camera
nadir camera
satellite motion
e
p
ground
Figure 5.7: Relating parallax p to elevation e by similar triangles: e/p = (h e)/b h/b.
Figure 5.8: A rudimentary DEM.
50
Figure 5.9: DEM generated with the commercial product AsterTDM.

Either the complete camera model or an RFM can be used, but usually neither is sufficient
for an absolute DEM relative to mean sea level. Most often additional ground reference
points within the image whose elevations are known are also required for absolute calibration.
The orthorectification of the image is carried out on the basis of a suitable DEM and
consists of projecting the (X, Y, Z) coordinates of each pixel onto the (X, Y ) coordinates of
a given map projection.
5.5
Slope and aspect
Terrain analysis involves the processing of elevation data. Specifically we consider here
the generation of slope images, which give the steepness of the terrain at each pixel, and
aspect images, which give the prevailing direction relative to north of a vector normal to the
landscape at each pixel.
A 33 pixel window can be used to determine both slope and aspect, see Figure 5.10.
Define
x1 = c a y1 = a g
x2 = f d
y2 = b h
x3 = i g
y3 = c i
and
x = (x1 + x2 + x3 )/(3xs )
y = (y1 + y2 + y3 )/(3xs ,
where xs , ys give the pixel dimensions in meters. Then the slope in % at the central pixel
position is given by
p
(x)2 + (y )2
s=
100
2
whereas the aspect in radians measured clockwise from north is

x
= tan1
.
y
5.6. ILLUMINATION CORRECTION
51
Figure 5.10: Pixel elevations in an 8-neighborhood. The letters represent elevations.
Slope/aspect determinations from a DEM are available in the ENVI main menu under
Topographic/Topographic Modelling.
5.6
Illumination correction
Figure 5.11: Angles involved in computation of local solar elevation, taken from [RCSA03].
Topographic modelling can be used to correct images for the effects of local solar illumination, which depends not only upon the suns position (elevation and azimuth) but also
upon the local slope and aspect of the terrain being illuminated. Figure 5.11 shows the
angles involved [RCSA03]. Solar elevation is i , solar azimuth is a , p is the slope and 0
is the aspect. The quantity to be calculated is the local solar elevation i which determines
52
the local irradiance. From trigonometry we have

cos i = cos p cos i + sin p sin i cos(a 0 ).
(5.8)
An example of a cos i image in hilly terrain is shown in Figure 5.12.
Figure 5.12: Cosine of local solar illumination angle stretched across a DEM.
Let T represent the reflectance of the inclined surface in Figure 5.11. Then for a
Lambertian surface, i.e. a surface which scatters reflected radiation uniformly in al directions,
the reflectance of the corresponding horizontal surface H would be
H = T
cos i
.
cos i
(5.9)
The Lambertian assumption is in general not correct, the actual reflectance being described by a complicated bidirectional reflectance distribution function (BRDF). An empirical appraoch which gives a better approximation to the BRDF is the C-correction [TGG82].
Let m and b be the slope and intercept of a regression line for reflectance vs. cos i for a
particular image band. Then instead of (5.9) one uses

cosi + b/m
H = T
.
(5.10)
cos i + b/m
An ENVI plug-in for illumination correction with the C-correction approximation is given in Appendix D.2.2.
Chapter 6
Image Registration
Image registration, either to another image or to a map, is a fundamental task in image
processing. It is required for georeferencing, stereo imaging, accurate change detection, or
any kind of multitemporal image analysis.
Image-to-image registration methods can be divided into roughly four classes [RC96]:
1. algorithms that use pixel values directly, i.e. correlation methods
2. frequency- or wavelet-domain methods that use e.g. the fast fourier transform(FFT)
3. feature-based methods that use low-level features such as edges and corners
4. algorithms that use high level features and the relations between them, e.g. objectoriented methods
We consider examples of frequency-domain and feature-based methods here.
6.1
Frequency domain registration
Consider two N N gray scale images g1 (i0 , j 0 ) and g2 (i, j), where g2 is offset relative to g1
by an integer number of pixels:
g2 (i, j) = g1 (i0 , j 0 ) = g1 (i i0 , j j0 ),
i0 , j0 N.
Taking the Fourier transform we have

X
g2 (k, l) =
g1 (i i0 , j j0 )ei2(ik+jl)/N ,
ij
or with a change of indices to i0 j 0 ,

X
0
0
g2 (k, l) =
g1 (i0 , j 0 )ei2(i k+j l)/N ei2(i0 k+j0 l)/N = g1 (k, l)ei2(i0 k+j0 l)/N .
i0 j 0
(This is referred to as the Fourier translation property.) Therefore we can write

g2 (k, l)
g1 (k, l)
= ei2(i0 k+j0 l)/N ,
|
g2 (k, l)
g1 (k, l)|
53
(6.1)
54
CHAPTER 6. IMAGE REGISTRATION
Figure 6.1: Phase correlation of two identical images shifted by 10 pixels.

where g1 is the complex conjugate of g1 . The inverse transform of the right hand side
exhibits a Dirac delta function (spike) at the coordinates (i0 , j0 ). Thus if two otherwise
identical images are offset by an integer number of pixels, the offset can be found by taking
their Fourier transforms, computing the ratio on the left hand side of (6.1) (the so-called
cross-power spectrum) and then taking the inverse transform of the result. The position of
the maximum value in the inverse transform gives the values of i0 and j0 . The following
IDL program illustrates the procedure, see Fig. 6.1
; Image matching by phase correlation
; read a bitmap image and cut out two 512x512 pixel arrays
filename = Dialog_Pickfile(Filter=*.jpg,/Read)
if filename eq then print, cancelled else begin
Read_JPeG,filename,image
g1 = image[0,10:521,10:521]
g2 = image[0,0:511,0:511]
; perform Fourier transforms
f1 = fft(g1, /double)
f2 = fft(g2, /double)
; Determine the offset
g = fft( f2*conj(f1)/abs(f1*conj(f1)), /inverse, /double )
6.2. FEATURE MATCHING
55
pos = where(g eq max(g))

print, Offset = + strtrim(pos mod 512) + strtrim(pos/512)
thisDevice =!D.Name
set_plot, PS
Device, Filename=c:\temp\phasecorr.eps,xsize=4,ysize=4,/inches,/Encapsulated
shade_surf,g[0,0:50,0:50]
device,/close_file
endelse
end
Images which differ not only by an offset but also by a rigid rotation and change of scale
can in principle be registered similarly, see [RC96].
6.2
Feature matching
A tedious task associated with image-image registration using low level image features is
the setting of ground control points (GCPs) since, in general, it is necessary to resort to
the manual entry. However various techniques for automatic determination of GCPs have
been suggested in the literature. We will discuss one such method, namely contour matching
[LMM95]. This technique has been found to function reliably in bitemporal scenes in which
vegetation changes do not dominate. It can of course be augmented (or replaced) by other
automatic methods or by manual determination. The procedures involved in image-image
registration using contour matching are shown in Fig. 6.2 [LMM95].
Image 1
LoG
Zero Crossing
Image 2
Edge Strength
Contour
Finder
Chain Code
Encoder
?
Image 2
(registered)
Warping
Consistency
Check
Closed Contour
Matching
Figure 6.2: Image-image registration with contour matching.
56
6.2.1
Contour detection
The first step involves the application of a Laplacian of Gaussian filter to both images. After
determining the contours by examining zero-crossings of the LoG-filtered image, the contour
strengths are encoded in the pixel intensities. Strengths are taken to be proportional to the
magnitude of the gradient at the zero-crossing.
6.2.2
Closed contours
In the next step, all closed contours with strengths above some given threshold are determined by tracing the contours. Pixels which have been visited during tracing are set to zero
so that they will not be visited again.
6.2.3
Chain codes
For subsequent matching purposes, all significant closed contours found in the preceding
step are chain encoded. Any digital curve can be represented by an integer sequence
{a1 , a2 . . . ai . . .}, ai {0, 1, 2, 3, 4, 5, 6, 7}, depending on the relative position of the current
pixel with respect to the previous pixel in the curve. This simple code has the drawback
that some contours produce wrap around. For example the line in the direction 22.5o has
the chain code {707070 . . .}. Li et al. [LMM95] suggest the smoothing operation:
{a1 a2 . . . an } {b1 b2 . . . bn },
where b1 = a1 and bi = qi , qi is an integer satisfying (qi ai ) mod 8 = 0 and |qi bi1 | min,
i = 2, 3 . . . n.
They also suggest the applying the Gaussian smoothing filter {0.1, 0.2, 0.4, 0.2, 0.1} to the
result. Two chain codes can be compared by sliding one over the other and determining
the maximum correlation between them.
6.2.4
Invariant moments
The closed contours are first matched according to their invariant moments. These are
defined as follows, see [Hab95, GW02]. Let the set C denote the set of pixels defining a
contour, with |C| = n, that is, n is the number of pixels on the contour. The moment of
order p, q of the contour is defined as
X
mpq =
j p iq .
(6.2)
i,jC
Note that n = m00 . The center of gravity xc , yc of the contour is thus

xc =
m10
,
m00
yc =
m01
.
m00
The centralized moments are then given by

pq =
X
i,jC
(j xc )p (i yc )q ,
(6.3)
6.2. FEATURE MATCHING
57
and the normalized centralized moments by

pq =
For example,
20 =
1
(p+q)/2+1
00
pq .
(6.4)
1
1 X
=
(j yc )2 .
20
200
n2
i,jC
The normalized centralized moments are, apart from effects of digital quantization, invariant
under scale changes and translations of the contours.
Finally, we can define moments which are also invariant under rotations, see [Hu62]. The
first two such invariant moments are
h1 = 20 + 02
2
h2 = (20 02 )2 + 411
.
(6.5)
For example, consider a general rotation of the coordinate axes with origin at the center of
gravity of a contour:
0

j
cos
sin
j
j
=
=
A
.
i0
sin cos
i
i
The first invariant moment in the rotated coordinate system is
0
1 X 02
1 X 0 0
j
02
(j + i ) = 2
(j , i ) 0
h1 = 2
i
n 0 0
n 0 0
i ,j C
i ,j C

1 X
j
(j, i)A> A
= 2
i
n
i,jC
1 X 2
= 2
(j + i2 ),
n
i,jC
since A> A = I.
6.2.5
Contour matching
Each significant contour in one image is first matched with contours in the second image
according to their invariant moments h1 , h2 . This is done by setting a threshold on the
allowed differences, for instance 1 standard deviation. If one or more matches is found, the
best candidate for a GCP pair is then chosen to be that matched contour in the second
image for which the chain code correlation with the contour in the first image is maximum.
If the maximum correlation is less that some threshold, e.g. 0.9, then no match is found.
The actual GCP coordinates are taken to be the centers of gravity of the matched contours.
6.2.6
Consistency check
The contour matching procedure invariably generates false GCP pairs, so a further processing step is required. In [LMM95] use is made of the fact that distances are preserved under
a rigid transformation. Let A1 A2 represent the distance between two points A1 and A2 in
58
an image. For two sets of m matched contour centers {Ai } and {Bi } in image 1 and 2, the
ratios
Ai Aj /Bi Bj , i = 1 . . . m, j = i + 1 . . . m,
are calculated. These should form a cluster, so that pairs scattered away from the cluster
center can be rejected as false matches.
An ENVI plug-in for GCP determination via contour matching is given in
Appendix D.3.
6.3
Re-sampling and warping
We represent with (x, y) the coordinates of a point in image 1 and the corresponding point
in image 2 with (u, v). A second order polynomial map of image 2 to image 1, for example,
is given by
u = a0 + a1 x + a2 y + a3 xy + a4 x2 + a5 y 2
v = b0 + b1 x + b2 y + b3 xy + b4 x2 + b5 y 2 .
Since there are 12 unknown coefficients, we require at least 6 GCP pairs to determine the
map (each pair generates 2 equations). If more than 6 pairs are available, the coefficients can
be found by least squares fitting. This has the advantage that an RMS error for the mapping
can be estimated. Similar considerations apply for lower or higher order polynomial maps.
Having determined the map coefficients, image 2 can be registered to image 1 by resampling. Nearest neighbor resampling simply chooses the actual pixel in image 2 that has
its center nearest the calculated coordinates (u, v) and transfers it to location (x, y). This
is the preferred technique for classification or change detection, since the registered image
consists of the original pixel brightnesses, simply rearranged in position to give a correct
image geometry. Other commonly used resampling methods are bilinear interpolation and
cubic convolution interpolation, see [JRR99] for details. These methods mix the spectral
intensities of neighboring pixels.
6.3. RE-SAMPLING AND WARPING
59
Exercises
1. We can approximate the centralized moments (6.3) of a contour by the integral
Z Z
pq =
(x xx )p (y yc )q f (x, y)dxdy,
where the integration is over the whole image and where f (x, y) = 1 if the point
(x, y) lies on the contour and f (x, y) = 0 otherwise. Use this approximation to prove
that the normalized centralized moments pq given in (3.4) are invariant under scaling
transformations of the form
0

x
0
x
=
.
y0
0
y
60
Chapter 7
Image Sharpening
The change detection and classification algorithms that we will meet in the next chapters
exploit of course not only the spatial but also the spectral information of satellite imagery.
Many common platforms (Landsat 7 TM, IKONOS, SPOT, QuickBird) offer panchromatic
images with higher ground resolution than that of the spectral channels. Application of multispectral change detection or classification methods is therefore restricted to the lower resolution. Conventional image fusion techniques, such as the well-known HSV-transformation
can be used to sharpen the spectral components, however the effect of mixing-in of the
panchromatic image is often to dilute the spectral resolution. Another disadvantage of
the HSV transformation is that one is restricted to using three of the available spectral
channels. In the following we will outline the HSV method and then consider alternative
fusion techniques.
7.1
HSV fusion
In computers with 24-bit graphics (true color), any three channels of a multispectral image
can be displayed with 8 bits for each of the additive primary colors red, green and blue. The
monitor displays this as an RGB color composite image which, depending on the choice of
image channels and their relative intensities, may or may not appear to be natural. There
are 224 16 million colors possible.
Another means of color definition is in terms of hue, saturation and value (HSV). Value
(or intensity) can be thought of as an axis equidistant from the three orthogonal primary
color axes. Hue refers to the actual color and is defined as an angle on a circle perpendicular
to the value axis. Saturation is the amount of color present and is represented by the
radius of the circle described by the hue,
A commonly used method for fusion of two images (for example a lower resolution multispectral image with a higher resolution panchromatic image) is to transform the first image
from RGB to HSV space, replace the V component with the grayscale values of the second
image after performing a radiometric normalization, and then transform back to RGB space.
The forward transformation begins by rotating the RGB coordinate axes into the diagonal
61
62
CHAPTER 7. IMAGE SHARPENING
axis of the RGB color cube. The coordinates in the new reference system are given by

m1
2/ 6
m2 = 0
i1
1/ 3

1/ 6 1/6
R
1/2 1/ 2 G .
1/ 3
1/ 3
B
Then the the rectangular coordinates (m1 , m2 , i1 ) are transformed into the cylindrical HSV
coordinates:
q
H = arctan(m1 /m2 ), S = m21 + m22 , I = 3 i1 .

The following IDL code illustrates the necessary steps for HSV fusion making use of ENVI
batch procedures. These are also invoked directly from the ENVI main menu.
pro HSVFusion, event
; get MS image
envi_select, title=Select low resolution three-band input file, $
fid=fid1, dims=dims1, pos=pos1
if (fid1 eq -1) or (n_elements(pos1) ne 3) then return
; get PAN image
envi_select, title=Select panchromatic image, $
fid=fid2, pos=pos2, dims=dims2, /band_only
if (fid2 eq -1) then return
envi_check_save, /transform
; linear stretch the images and convert to byte format
envi_doit,stretch_doit, fid=fid1, dims=dims1, pos=pos1, method=1, $
r_fid=r_fid1, out_min=0, out_max=255, $
range_by=0, i_min=0, i_max=100, out_dt=1, out_name=c:\temp\hsv_temp
envi_doit,stretch_doit, fid=fid2, dims=dims2, pos=pos2, method=1, $
r_fid=r_fid2, out_min=0, out_max=255, $
range_by=0, i_min=0, i_max=100, out_dt=1, /in_memory
envi_file_query, r_fid2, ns=f_ns, nl=f_nl
f_dims = [-1l, 0, f_ns-1, 0, f_nl-1]
; HSV sharpening
envi_doit, sharpen_doit, $
fid=[r_fid1,r_fid1,r_fid1], pos=[0,1,2], f_fid=r_fid2, $
f_dims=f_dims, f_pos=[0], method=0, interp=0, /in_memory
; remove temporary files from ENVI
envi_file_mng, id=r_fid1, /remove, /delete
envi_file_mng, id=r_fid2, /remove
end
7.2. BROVEY FUSION
7.2
63
Brovey fusion
In its simplest form this method multiplies each re-sampled multispectral pixel by the ratio
of the corresponding panchromatic pixel intensity to the sum of all of the multispectral
intensities. The corrected pixel intensities gk (i, j) in the kth fused multispectral channel are
given by
gp (i, j)
,
0
k0 gk (i, j)
gk (i, j) = gk (i, j) P
(7.1)
where gk (i, j) is the (re-sampled) pixel intensity in the kth channel and gp (i, j) is the corresponding pixel intensity in the panchromatic image. (The ENVI-environment offers Brovey
fusion in its main menu.) This technique assumes that the spectral range spanned by the
panchromatic image is essentially the same as that covered by the multispectral channels.
This is seldom the case. Moreover, to avoid bias, the intensities used should be the radiances
at the satellite sensors, implying use of the sensors calibration.
7.3
PCA fusion
Panchromatic sharpening using principal components analysis (PCA) is similar to the HSV
method. After the PCA transformation, the first principal component is replaced by the
panchromatic image, again after radiometric normalization, see Figure 7.1.
Figure 7.1: Panchromatic fusion with the principal components transformation.

Image sharpening using PCA and the closely related Gram-Schmidt transformation is
available from the ENVI main menu.
64
7.4
Wavelet fusion
Wavelets provide an efficient means of representing high and low frequency components of
multispectral images and can be used to perform image sharpening. Two examples are given
here.
7.4.1
Discrete wavelet transform
The discrete wavelet transform (DWT) of a two-dimensional image is shown in Appendix

B to be equivalent to an iterative application of the high-low-pass filter bank illustrated in
Figure 7.2
Rows
Columns

- H - - H
-
-g
(i, j)

k+1
gk (i, j)
- G

-
- C H (i, j)
k+1

-
- C V (i, j)
k+1

-
- C D (i, j)
k+1
- G

- - H

- G
Figure 7.2: Wavelet filter bank. H is a low-pass and G a high-pass filter derived from the
coefficients of the wavelet transformation. The symbol indicates downsampling by a factor
of 2. The original image gk (i, j) can be reconstructed by inverting the filter.
A single application of the filter corresponding to the Daubechies D4 wavelet to a satellite

image g1 (i, j) (1m resolution) is shown in Figure B.12. The high frequency information
(wavelet coefficients) is stored in the arrays C2H , C2V and C2D and displayed in the upper right,
lower left and lower right quadrants, respectively. The original image with its resolution
degraded by a factor two, g2 (i, j), is in the upper left quadrant. Applying the filter bank
iteratively to the upper left quadrant yields a further reduction by a factor of 2.
The fusion procedure for IKONOS or QuickBird imagery for instance, in which the
resolutions of panchromatic and the 4 multispectral components differ exactly by a factor
of 4, is then as follows: Both the degraded panchromatic image and the four multispectral
images are compressed once again (e.g. to 8m resolution in the case of IKONOS) and the high
frequency components C4z , z = H, V, D, are sampled to estimate the correction coefficients
z
z
az = ms
/pan
bz = mzms az mzpan ,
(7.2)
where mz and z denote mean and standard deviation, respectively. These coefficients are
then used to normalize the wavelet coefficients for the panchromatic image to those of the
multispectral image:
Ciz (i, j) az Ciz (i, j) + bz ,
z = H, V, D, i = 2, 3.
(7.3)
7.4. WAVELET FUSION
65
The degraded panchromatic image g3 (i, j) is then replaced by the each of the four multispectral images and the normalized wavelet coefficients are used to reconstruct the original 1m
resolution. We thus obtain what would be seen if the multispectral sensors had the resolution
of the panchromatic sensor [RW00].
An ENVI plug-in for panchromatic sharpening with the DWT is given in
Appendix D.4.1.
7.4.2
` trous filtering
A
The radiometric fidelity obtained with the discrete wavelet transform is excellent, as will be
shown in the next section. However the lack of translational invariance of the DWT often
leads to spatial artifacts (blurring, shadowing, staircase effect) in the sharpened product.
This is illustrated in the following program, in which an image is transformed once with the
DWT and the low-pass quadrant shifted by one pixel relative to the high-pass quadrants
(i.e. the wavelet coefficients). After inverting the transformation, serious degradation is
apparent, see Figure 7.3.
pro translate_wavelet
; get an image band
envi_select, title=Select input file, $
fid=fid, dims=dims, pos=pos, /band_only
if fid eq -1 then return
; create a DWT object
aDWT = Obj_New(DWT,envi_get_data(fid=fid,dims=dims,pos=pos))
; compress
aDWT->compress
; shift the compressed portion supressing phase correlation match
aDWT->inject,shift(aDWT->Get_Quadrant(0),[1,1]),pc=0
; restore
aDWT->expand
; return result to ENVI
envi_enter_data, aDWT->get_image()
end
As an alternative to the DWT, the `
a trous wavelet transform (ATWT) has been proposed
for image sharpening [AABG02]. The ATWT is a multiresolution decomposition defined
formally by a low-pass filter H = {h(0), h(1), . . .} and a high-pass filter G = H, where
denotes an all-pass filter. Thus the high frequency part is just the difference between the
original image and low-pass filtered image. Not surprisingly, this transformation does not
allow perfect reconstruction if the output is downsampled. Therefore downsampling is not
performed at all. Rather, at the kth iteration of the low-pass filter, 2k1 zeroes are inserted
between the elements of H. This means that every other pixel is interpolated on the first
iteration:
H = {h(0), 0, h(1), 0, . . .},
while on the second iteration
H = {h(0), 0, 0, h(1), 0, 0, . . .}
etc. (hence the name `
a trous = with holes). The low-pass filter is usually chosen to be
symmetric (unlike the Daubechies wavelet filters for example). The prototype filter chosen
66
here is the cubic B-spline filter

H = {1/16, 1/4, 3/8, 1/4, 1/16}.
The transformation is highly redundant and requires considerably more computer storage
to implement. However when used for image sharpening it is much less sensitive to misalignment between the multispectral and panchromatic images.
Figure 7.3: Artifacts due to lack of translational invariance of the DWT.

Figure 7.4 outlines the scheme implemented in the ENVI plug-in for ATWT panchromatic
sharpening. The MS band is nearest-neighbor upsampled by a factor of 2 to match the
dimensions of the high resolution band. The `
a trous transformation is applied to both bands
(columns and rows are filtered with the upsampled cubic spline filter, with the difference
determining the high-pass result). The high frequency component of the pan image is
normalized to that of the MS image in the same way as for DWT sharpening, equations
(7.2) and (7.3). Then the low frequency pan component is replaced by the filtered MS
image and the transformation inverted. An ENVI plug-in for ATWT sharpening is
described in Appendix D.4.2.
7.5
Quality indices
Wang and Bovik [WB02] suggest the following measure of radiometric fidelity between two
image bands f and g:
7.5. QUALITY INDICES
67
6
6
Pan
G
MS

MS(sharpened)
6
normalize
?
insert
` trous image sharpening scheme for an MS to panchromatic resolution ratio of

Figure 7.4: A
two. The symbol H denotes the upsampled low-pass filter.
Figure 7.5: Comparison of three image sharpening methods with the Wang-Bovik quality
index. Left to right: Gram-Schmidt, ATWT, DTW.
68
Q=
f g
2fg
2f g
4f g fg
2
=
f g f + g2 f2 + g2
(f2 + g2 )(f2 + g2 )
(7.4)
where f and f are mean and variance of band f and f g is the covariance of the two
bands. This first term in (7.4) is seen to be the correlation coefficient between the two
images, with values in [1, 1], the second term compares their average brightness, with
values in [0, 1] and the third term compares their contrasts, also in [0, 1]. Thus perfect
radiometric correspondence would give a value Q = 1.
Since image quality is usually not spatially invariant, it is usual to compute Q in, say,
M sliding windows and then average over all such windows:
Q=
M
1 X
Qj .
M j=1
An ENVI plug-in for determining the quality index for pansharpened images is
given in Appendix D.4.3.
Figure 7.5 shows a comparison of three image sharpening methods applied to a QuickBird
image, namely the Gram-Schmidt, ATWT and DWT transformations. The latter is by far
the best, but spatial artifacts are apparent.
Chapter 8
Change Detection
To quote Singhs review article on change detection [Sin89],
The basic premise in using remote sensing data for change detection is that
changes in land cover must result in changes in radiance values ... [which] must
be large with respect to radiance changes from other factors.
In the present chapter we will mention briefly the most commonly used digital techniques for
enhancing this change signal in bitemporal satellite images, and then focus our attention
on the so-called multivariate alteration detection algorithm of Nielsen et al. [NCS98].
8.1
Algebraic methods
In order to see changes in the two multispectral images represented by N -dimensional random vectors F and G, a simple procedure is to subtract them from each other componentby-component, examining the N differenced images characterized by
F G = (F1 G1 , F2 G2 . . . FN GN )>
(8.1)
for significant changes. Pixel intensity differences near zero indicate no change, large positive
or negative values indicate change, and decision thresholds can be set to define significant
changes. If the difference signatures in the spectral channels are used to classify the kind of
change that has taken place, one speaks of change vector analysis. Thresholds are usually
expressed in standard deviations from the mean difference value, which is taken to correspond
to no change.
Alternatively, ratios of intensities of the form
Fk
,
Gk
k = 1...N
(8.2)
can be built between successive images. Ratios near unity correspond to no-change, while
small and large values indicate change. A disadvantage of this method is that random
variables of the form (8.2) are not normally distributed, so simple threshold values defined
in terms of standard deviations are not valid.
Other algebraic combinations, such as differences in vegetation indices (Section 2.1) are
also in use. All of these band math operations can of course be performed conveniently
within the ENVI/IDL environment.
69
70
8.2
CHAPTER 8. CHANGE DETECTION
Principal components
Figure 8.1: Change detection with principal components.

Consider the bitemporal feature space for a single spectral band m in which each pixel
is denoted by a point (fm , gm ), a realization of the random vector (Fm , Gm ). Since the
unchanged pixels are highly correlated, they will lie in a narrow, elongated cluster along the
principal axis, whereas change pixels will lie some distance away from it, see Fig. 8.1. The
second principal component will thus quantify the degree of change associated with a given
pixel. Since the principal axes are determined by diagonalization of the covariance matrix for
all of the pixels, the no-change axis may be poorly determined. To avoid this problem, the
principal components can be determined iteratively using weights for each pixel according
to the magnitude of the second principal component. This method can be generalized to
treat all multispectral bands simultaneously [Wie97].
8.3
Post-classification comparison
If two co-registered satellite images have been classified, then the class labels can be compared to determine land cover changes. If classification is carried out at the pixel level (as
opposed to segments or objects), then classification errors (typically > 5%) may dominate
the true changes, depending on the magnitude of the latter. ENVI offers functions for
statistical analysis of post-classification change detection.
8.4. MULTIVARIATE ALTERATION DETECTION
8.4
71
Multivariate alteration detection
Suppose we make a linear combination of the intensities for all N channels in the first image
acquired at time t2 , represented by the random vector F. That is, we create a single image
whose pixel intensities are
U = a > F = a 1 F 1 + a 2 F2 + . . . aN F N ,
where the vector of coefficients a is as yet unspecified. We do the same for t2 , i.e. we make
the linear combination V = b> G, and then look at the scalar difference image U V . This
procedure combines all the information into a single image, whereby one still hast to choose
the coefficients a and b in some suitable way. Nielsen et al. [NCS98] suggest determining
the coefficients so that the positive correlation between U and V is minimized. This means
that the resulting difference image U V will show maximum spread in its pixel intensities.
If we assume that the spread is primarily due to actual changes that have taken place in the
scene over the interval t2 t1 , then this procedure will enhance those changes as much as
possible.
Specifically we seek linear combinations such that
var(U V ) = var(U ) + var(V ) 2cov(U, V ) maximum,
(8.3)
subject to the constraints

var(U ) = var(V ) = 1.
(8.4)
var(U V ) = 2(1 ),
(8.5)
Note that under these constraints
where is the correlation of the transformed vectors U and V ,

= corr(U, V ) = p
cov(U, V )
var(U )var(V )
Since we are dealing with change detection, we require that the random variables U and V
be positively correlated, that is,
cov(U, V ) > 0.
We thus seek vectors a and b which minimize the positive correlation .
8.4.1
Canonical correlation analysis
Canonical correlation analysis leads to a transformation of each set of variables F and G

such that their mutual correlation is displayed unambiguously, see [And84], Chapter 12. We
can derive the transformation as follows:
For multivariate normally distributed data the combined random vector is distributed
as

0
f f f g
F
N
,
.
0
gf gg
G
Recalling the property (1.6) we have
var(U ) = a> f f a,
var(V ) = b> gg b,
cov(U, V ) = a> f g b.
72
If we introduce the Lagrange multipliers /2 and /2, extremalizing the covariance cov(U, V )
under the constraints (8.4) is equivalent to extremalizing the unconstrained Lagrange function
L = a> f g b (a> f f a 1) (b> gg b 1).

2
2
Differentiating, we obtain
L
= f g b 2f f a = 0,
a
2
or
a=
1 1
f g b,
ff
= f g a 2gg b = 0
b
2
b=
1 1
f g a.
gg
The correlation between the random variables U and V is

cov(U, V )
= p
var(U )var(V )
=p
a > f g b
.
a> f f a b> gg b
Substituting for a and b in this equation gives (with >

f g = gf )
2 =
a> f g 1
gg gf a
,
>
a f f a
2 =
b> gf 1
f f f g b
b> gg b
which are equivalent to the two generalized eigenvalue problems

2
f g 1
gg gf a = f f a
2
gf 1
f f f g b = gg b.
(8.6)
Thus the desired projections U = a> F are given by the eigenvectors a1 . . . aN corresponding
to the generalized eigenvalues
2 1 . . . N
>
of f g 1
gg gf with respect to f f . Similarly the desired projections V = b G are given
1
by the eigenvectors b1 . . . bN of gf f f f g with respect to gg corresponding to the same
eigenvalues. Nielsen et al. [NCS98] refer to the N difference components
Mi = Ui Vi = ai > F bi > G, i = 1 . . . N,
(8.7)
as the multivariate alteration detection (MAD) components of the combined bitemporal

image.
8.4.2
Solution by Cholesky factorization
Equations (8.6) are of the form

1 a = a,
where both 1 and are symmetric and is positive definite. The Cholesky factorization
of is
= LL> ,
where L is a lower triangular matrix, and can be thought of as the square root of . Such
an L always exists is is positive definite. Therefore we can write
1 a = LL> a

or, equivalently,
73
L1 1 (L> )1 L> a = L> a
or, with d = L> a and commutivity of inverse and transpose,

[L1 1 (L1 )> ]d = d,
a standard eigenproblem for a real, symmetric matrix L1 1 (L1 )> . Let the (orthogonal)
eigenvectors be di . We have
>
>
>
0 = d>
i dj = ai LL aj = ai aj ,
8.4.3
i 6= j.
(8.8)
Properties of the MAD components
We have, from (8.4) and (8.8), for the eigenvectors ai and bi ,

>
a>
i f f aj = bi gg bj = ij .
Furthermore
1
bi = 1
gg gf ai ,
i
i.e. substituting this into the LHS of the second equation in (8.6):
p
1
1
1 1
gf 1
i ai = i gg bi ,
f f f g gg gf ai = gf f f i f f ai = gf
i
i
as required. It follows that
p
p
> 1
f g 1
j a >
j ij ,
a>
gg gf aj =
i f g bj = ai p
i f f ai =
j
and similarly for b>
i gf aj . Thus the covariances of the MAD components are given by
p
>
>
>
cov(Ui Vi , Uj Vj ) = cov(a>
j ).
i F bi G, aj F bj G) = 2ij (1
The MAD components are therefore orthogonal (uncorrelated) with variances
p
2
i ).
var(Ui Vi ) = M
ADi = 2(1
(8.9)
The transformation corresponding to the smallest eigenvalue, namely (aN , bN ), will thus
give maximal variance for the difference U V .
We can derive change probabilities from a MAD image as follows. The sum of the squares
of the standardized MAD components for no-change pixels, given by

2

2
M ADN
M AD1
Z=
+ ... +
,
M AD1
M ADN
is approximately chi-square distributed with N degrees of freedom, i.e.,
P r(Z z) = P (N/2, z/2).
For a given measured value z for some pixel, the probability that Z could be that large or
larger, given that the pixel is no-change, is
1 P (N/2, z/2).
74
The probability that the pixel is a change pixel is therefore the complement of this,
Pchange (z) = 1 (1 P (N/2, z/2)) = P (N/2, z/2).
(8.10)
This quantity can be plotted for example as a gray scale image to show the regions of change.
The last MAD component has maximum spread in its pixel intensities and, ideally,
maximum change information. However, depending on the type of change one is looking for,
the other components may also be extremely useful. The second-to-last image has maximum
spread subject to the condition that the pixel intensities are statistically uncorrelated with
those in the first image, and so on. Since interesting anthropomorphic changes will generally
be uncorrelated with dominating seasonal vegetation changes or stochastic image noise, it is
quite common that such changes will be concentrated in higher order components. This in
fact is one of the nicest aspects of the method it sorts different categories of change into
different image components. Therefore we can also perform change vector analysis on the
MAD change vector.
An ENVI plug-in for MAD is given in Appendix D.5.1.
8.4.4
Covariance of MAD variates with original observations
With (8.7) and A = (ai . . . aN ), B = (bi . . . bN )

hFM> i = hF(A> F B> G)> i = f f A f g B
hGM> i = hG(A> F B> G)> i = gf A gg B.
8.4.5
Scale invariance
An additional advantage of the MAD procedure stems from the fact that the calculations
involved are invariant under linear transformations of the original image intensities. This
implies that the method is insensitive to differences in atmospheric conditions or sensor
calibrations at the two acquisition times. We can see this as follows. Suppose the second
image G is transformed according to some linear transformation T,
H = TG.
The relevant covariance matrices for (8.6) are then
0f g = hFH> i = f g T>
0gf = hHF> i = Tgf
0f f = f f
0gg = hHH> i = Tgg T> .
The eigenproblems are therefore
f g T> (Tgg T> )1 Tgf a = 2 f f a
>
2
>
Tgf 1
f f f g T c = Tgg T c,
where c is the desired projection for H. These are equivalent to

2
f g 1
gg gf a = f f a
>
2
>
gf 1
f f f g (T c) = gg (T c),
75
which are identical to (8.6) with b = T> c. Therefore the MAD components in the transformed situation are
>
>
>
>
>
>
>
>
a>
i F ci H = ai F ci TG = ai F (T ci ) G = ai F bi G
as before.
8.4.6
Improving signal to noise
The MAD transformation can be augmented by subsequent application of the maximum

autocorrelation factor (MAF) transformation, in order to improve the spatial coherence of
the difference components, see [NCS98]. When image noise is estimated as the difference
between intensities of neighboring pixels, the MAF transformation is equivalent to the MNF
transformation. The MAF/MAD variates thus generated are also orthogonal and invariant
under affine transformations. An ENVI plug-in for performing the MAF transformation is given in Appendix D.5.2.
8.4.7
Decision thresholds
Since the MAD components are approximately normally distributed about zero and uncorrelated, see Figure 8.2, decision thresholds for change or no change pixels can be set in terms
of standard deviations about the mean for each component separately. This can be done
arbitrarily, for example by saying that all pixels in a MAD component whose intensities are
within 2M AD are no-change pixels.
Figure 8.2: Scatter plot of two MAD components.

We can do better than this, however, using a Bayesean technique. Let us consider the
following mixture model for a random variable X representing one of the MAD components:
p(x) = p(x | N C)p(N C) + p(x | C)p(C) + p(x | C+)p(C+),
(8.11)
76
Figure 8.3: Probability mixture model for MAD components.

where C+, C and N C denote positive change, negative change and no change, respectively,
see Fig. 8.3. The set of measurements S = {xi } may be grouped into four disjoint sets:
SN C ,
SC ,
SU = S\SN C SC SC+ ,
SC+ ,
with SU denoting the set of ambiguous pixels.1 From the sample mean and sample variance,
we estimate initially the moments for the distribution of no-change pixels:
N C =
(N C )2 =
1
|SN C |
1
|SN C |
xi ,
iSN C
(xi N C )2
iSN C
(|S| denotes set cardinality) and similarly for C and C+. Bruzzone and Prieto [BP00]
suggest improving these estimates by using the pixels in SU and applying the so-called EM
algorithm (see [Bis95] for a good explanation):
0N C =
p(N C | xi )xi /
iS
0
2
(N
C) =
p(N C | xi )
iS
p(N C | xi )(xi 0N C )2 /
iS
p0 (N C) =
1 X
p(N C | xi ) ,
|S|
X
iS
p(N C | xi )
(8.12)
iS
where p(N C | xi ) is the a posteriori probability for a no-change pixel conditional on measurement xi . We have the following rules for determining p(N C | xi ):
1. i SN C :
p(N C | xi ) = 1
2. i SC :
p(N C | xi ) = 0
1 The symbols and \ denote set union and set difference, respectively. These sets can be determined
in practice by setting generous, scene-independent thresholds for change and no-change pixel intensities, see
[BP00].
8.5. RADIOMETRIC NORMALIZATION

3. i SU :
77
p(N C | xi ) = p(xi | N C)p(N C)/p(xi ) (Bayes rule),

1
(xi N C )2
where p(xi | N C) =
exp
2
2N
2N C
C
and p(xi ) = p(xi | N C)p(N C) + p(xi | C)p(C)
+ p(xi | C+)p(C+).
Substituting into (8.12) we obtain the set of equations

P
P
iSU p(N C | xi )xi +
iSN C xi
0
P
N C =
p(N
C
|
x
)
+
|S
i
NC |
iSU
P
P
0
2
0
2
iSU p(N C | xi )(xi N C ) +
iSN C (xi N C )
0
2
P
(N C ) =
iSU p(N C | xi ) + |SN C |
!
X
1
0
p(N C | xi ) + |SN C | ,
p (N C) =
|S|
iSU
which can be iterated numerically to improve the initial estimates of the distributions. One
can then determine e.g. the upper change threshold as the appropriate solution of
p(x | N C)p(N C) = p(x | C+)p(C+).
Taking logarithms,

1
N C p(C+)
1
2
2
(x
(x
)
=
log
=: A
C+
NC
2
2
2C+
2N
C+ P (N C)
C
with solutions
x=
2
2
C+ N
C N C C+ N C C+
2
2
(N C C+ )2 + 2A(N
C C+ )
2
2
N
C C+
A corresponding expression obtains for the lower threshold.

In the next chapter we will extend this method to discriminate clusters of change and
no change pixels. An ENVI GUI for determining change thresholds is given in
Appendix D.6.6.
8.5
Radiometric normalization
Radiometric normalization of satellite imagery requires, among other things, an atmospheric

correction algorithm and the associated atmospheric properties at the times of image acquisition. For most historical satellite scenes such data are not available and even for planned
acquisitions they may be difficult to obtain. A relative normalization based on the radiometric information intrinsic to the images themselves is an alternative whenever absolute surface
radiances are not required, for example in change detection applications or for supervised
land cover classification.
One usually proceeds under the assumption that the relationship between the at-sensor
radiances recorded at two different times from regions of constant reflectance is spatially
78
homogeneous and can be approximated by linear functions. The critical aspect is the determination of suitable time-invariant features upon which to base the normalization.
As we have seen, the MAD transformation invariant to linear and affine scaling. Thus, if
one uses MAD for change detection applications, preprocessing by linear radiometric normalization is superfluous. However radiometric normalization of imagery is important for many
other applications, such as mosaicing, tracking vegetation indices over time, supervised and
unsupervised land cover classification, etc. Furthermore, if some other, non-invariant change
detection procedure is preferred, it must generally be preceded by radiometric normalization [CNS04]. Taking advantage of this invariance, one can apply the MAD transformation
to select the no-change pixels in bitemporal images, and then used them for radiometric
normalization. The procedure is simple, fast and completely automatic and compares very
favorably with normalization using hand-selected, time-invariant features.
An ENVI plug-in for radiometric normalization with the MAD transformation is given in Appendix D.5.3.
Chapter 9
Unsupervised Classification
Supervised classification of multispectral remote sensing imagery is commonly used for landcover determination, see Chapter 10. For supervised classification it is very important to
define training areas which adequately represent the spectral characteristics of each class in
the image to be classified, as the quality of the training set has a significant effect on the
classification process and its accuracy. Finding and verifying training areas can be rather
laborious since the analyst must select representative pixels for each of the classes. This
must be done by visual examination of the image data and by information extraction from
additional sources such as ground reference data (ground truth) or existing maps.
Unlike supervised classification, clustering methods (or unsupervised methods) require
no training sets at all. Instead, they attempt to find the underlying structure automatically
by organizing the data into classes sharing similar, e.g. spectrally homogeneous, characteristics. The analyst simply needs to specify the number of clusters present. Clustering plays
an especially important role when very little a priori information about the data is available and provides a useful method for organizing a large set of data so that the retrieval
of information may be made more efficiently. A primary objective of using clustering algorithms for pre-classification of multispectral remote sensing data in particular is to obtain
optimum information for the selection of training regions for subsequent supervised land-use
segmentation of the imagery.
9.1
A simple cost function
We begin with the assumption that the measured features (pixel intensities)
x = {xi | i = 1 . . . n}
are chosen independently from K multivariate normally distributed populations corresponding the K principal land cover categories present in the image. The xi are thus realization
of random vectors
Xk N (k , k ), k = 1 . . . K.
(9.1)
Here k and k are the expected value and covariance matrix of Xk , respectively. We
denote a given clustering by C = {C1 , . . . Ck , . . . CK } where Ck denotes the index set for
the kth cluster.1 We wish to maximize the posteriori probability p(C | x) for observing the
1 The
set of indices {i | i = 1 . . . n, xi is in class k}.
79
80
CHAPTER 9. UNSUPERVISED CLASSIFICATION
clustering given the data. From Bayes rule,

p(C | x) =
p(x | C)p(C)
.
p(x)
(9.2)
The quantity p(x|C) is the joint probability density function for clustering C, also referred to
as the likelihood of observing the clustering C given the data x, P (C) is the prior probability
for C and p(x) is a normalization independent of C.
The joint probability density for the data is the product of the individual probability
densities, i.e.,
p(x | C) =
K
Y
Y
p(xi | Ck )
k=1 iCk
K
Y
Y
N/2
(2)
1/2
|k |
k=1 iCk

1
> 1
exp (xi k ) k (xi k ) .
2
Forming the product in this way is justified by the independence of the samples. The
log-likelihood is given by [Fra96]
L = log p(x | C)
K

X
X N
1
1
=
log(2) log |k | (xi k )> 1

(x
)
.
i
k
k
2
2
2
k=1 iCk
With (9.2) we can therefore write

log p(C | x) L + log p(C).
(9.3)
If all K classes exhibit identical covariance matrices according to

k = 2 I,
k = 1 . . . K,
(9.4)
where I is the identity matrix, then L is maximized when the expression

K
X
X
k=1 iCk
K
X
X (xi )> (xi )
1
k
k
(xi ) k ) ( 2 I)(xi k ) =
2
2 2
>
k=1 iCk
is minimized. We are thus led to the cost function

E(C) =
K
X
X (xi )> (xi )
k
k
log p(C).
2 2
(9.5)
k=1 iCk
An optimal clustering C under these assumptions is achieved for

E(C) min .
Now we introduce a hard class dependency in the form of a matrix u with elements
given by
n
1 if i Ck
(9.6)
uki =
0 otherwise.
9.2. ALGORITHMS THAT MINIMIZE THE SIMPLE COST FUNCTION
81
The matrix u satisfies the conditions

K
X
uki = 1,
i = 1 . . . n,
(9.7)
k=1
meaning that each sampled pixel xi , i = 1 . . . n, belongs to precisely one class, and
n
X
uki > 0,
k = 1 . . . K,
(9.8)
i=1
meaning that no class Ck , k = 1 . . . K, is empty. The sum in (9.8) is the number nk of pixels
in the kth class. An unbiased estimate mk of the expected value k for the kth cluster is
therefore given by
Pn
uki xi
1 X
k m k =
xi = Pi=1
, k = 1 . . . K,
(9.9)
n
nk
i=1 uki
iCk
and an estimate Fk of the covariance matrix k by

Pn
uki (xi mk )(xi mk )>
Pn
k Fk = i=1
,
i=1 uki
k = 1 . . . K.
(9.10)
We can now write (9.5) in the form

E(C) =
K X
n
X
k=1 i=1
uki
(xi mk )> (xi mk )

log p(C).
2 2
(9.11)
Finally, if we do not wish to include prior probabilities, we can simply say that all clustering
configurations C are a priori equally likely. Then the last term in (refe911) is independent of
C and we have, dropping the multiplicative constant 1/2 2 , the well-known sum-of-squares
cost function
K X
n
X
uki (xi mk )> (xi mk ).
(9.12)
E(C) =
k=1 i=1
9.2
Algorithms that minimize the simple cost function
We begin with the popular K-means method and then consider an algorithm due to (Palubinskas 1998) [Pal98], which uses cost function (9.11) and for which the number of clusters
is determined automatically. Then we discuss a common version of bottom-up or agglomerative hierarchical clustering, and finally a fuzzy version of the K-means algorithm.
9.2.1
K-means
The K-means clustering algorithm (KM) (sometimes referred to as basic Isodata [DH73] or
migrating means [JRR99]) is based on the cost function (9.12). After initialization of the
cluster centers, the distance measure corresponding to a minimization of (9.12), namely
d(i, k) = (xi mk )> (xi mk )
is used to re-cluster the pixel vectors. Then (9.9) is used to recalculate the cluster centers.
This procedure is iterated until the centers cease to change significantly. K-means clustering
may be performed within the ENVI environment from the main menu.
82
9.2.2
Extended K-means
Denote by pk = p(Ck ) the prior probability for cluster k. The entropy S associated with
this prior distribution is
K
X
S=
pk log pk .
(9.13)
k=1
Distributions with high entropy are those for which the pi are all similar, that is, the pixels
are distributed evenly over all available clusters, see [Bis95]. Low entropy means that most
of the data are concentrated in very few clusters. We choose a prior distribution p(C) in
(9.11) for which few clusters are more probable than many clusters, namely
p(C) exp(E S) = exp E
K
X

pk log pk ,
k=1
where E is a parameter. The cost function (9.11) can then be written as

E(C) =
K X
n
X
X
(xi mk )> (xi mk )
E
pk log pk .
2
2
K
uki
k=1 i=1
(9.14)
k=1
With
nk
1X
=
uki
n
n i=1
n
pk
(9.15)
this becomes
E(C) =
K X
n
X

uki
k=1 i=1

(xi mk )> (xi mk ) E
log pk .
2 2
n
(9.16)
An estimate for the parameter E may be obtained as follows [Pal98]: From (9.14) and
(9.15)

K
X
nk2 pk
E(C)
p
log
p
E k
k .
2 2
k=1
Equating the likelihood and prior terms in this expression and taking k2 2 and pk 1/K,
where K is some a priori expected number of clusters, gives

n
E
.
(9.17)
2 log(1/K)
The parameter 2 can be estimated from the data.
The extended K-means (EKM) algorithm is as follows: First an initial configuration with
a very large number of clusters K is chosen (for one-dimensional data this might conveniently
be the 256 gray values that a pixel with 8-bit resolution can have) and initial values
mk =
n
1 X
uki xi ,
nk i=1
pk =
nk
n
(9.18)
are determined. Then the data are re-clustered according to the distance measure corresponding to a minimization of (9.16):
d(i, k) =
(xi mk )> (xi mk ) E
log pk .
2 2
n
(9.19)
9.2. ALGORITHMS THAT MINIMIZE THE SIMPLE COST FUNCTION
83
The prior term tends to reduce the number of clusters and any class which has in the course
of the algorithm nk = 0 is simply dropped from the calculation. (Condition (9.8) is thus
relaxed.) Iteration of (9.18) and (9.19) continues until no significant changes in the mk
occur.
The explicit choice of the number of clusters K is replaced by the necessity of choosing a
value for the meta-parameter E . This has the advantage that we can use one parameter
for a wide variety of images and let the algorithm itself decide on the actual value of K in
any given instance.
9.2.3
Agglomerative hierarchical clustering
The agglomerative hierarchical clustering algorithm that we consider here is, as for K-means,
based on the cost function (9.12), see [DH73]. It begins by assigning each pixel in the dataset
to its own class or cluster. At this stage of course, the cost function E(C), Eq. (9.12), is
zero. We write E(C) in the form
K
X
Ek
(9.20)
E(C) =
k=1
where Ek is given by
Ek =
(xi mk )> (xi mk ).
iCk
Every agglomeration of clusters to form a smaller number of clusters will increase E(C).
We therefore seek a prescription for choosing two clusters for combination that will increase
E(C) by the smallest amount possible.
Suppose clusters k with nk members and ` with n` members are merged, k < `, and the
new cluster is labeled k. Then
mk
n k mk + n ` m `
=: m.
n k + n`
Thus after the agglomeration, Ek changes to

X
(xi m)
> (xi m)
Ek =
iCk C`
and E` disappears. The net change in E(C) is therefore, after some algebra,
X
X
(k, `) =
(xi m)
> (xi m)

(xi mk )> (xi mk )
iCk C`
iCk
>
(xi m` ) (xi m` )
(9.21)
iC`
nk n`
(mk m` )> (mk m` ).
nk + n`
The minimum increase in E(C) is achieved by combining those two clusters k and ` which
minimize the above expression. Given two alternative candidate cluster pairs with similar combined memberships nk + n` and whose means have similar euclidean separations
kmk m` k, this prescription obviously favors combining that pair with the larger discrepancy between nk and n` . Thus similar-sized clusters are preserved and smaller clusters are
absorbed by larger ones.
84
Let hk, ì represent the cluster formed by combination of the clusters k and `. Then the
increase in cost incurred by combining this cluster with cluster r can be determined from
(9.21) as
(nk + nr )(k, r) + (n` + nr )(`, r) nr (k, `)
.
(9.22)
(hk, ì, r) =
nk + n` + nr
Once
1
(xi xj )> (xi xj )
2
for i, j = 1 . . . n has been initialized from (9.21) for all possible combinations of pixels, the
recursive formula (9.22) can be used to calculate efficiently the cost function for any further
combinations without reference to the original data.
The algorithm terminates when the desired number of clusters has been reached or
continues until a single cluster has been formed. Assuming that the data consist of K
compact and well separated clusters, the slope of E(C) vs. the number of clusters K should
decrease (become more negative) for K K.

(i, j) =
An ENVI plug-in for agglomerative hierarchic clustering is given in Appendix

D.6.1.
9.2.4
Fuzzy K-means
For q > 1 we write (9.9) and (9.12) in the equivalent forms [Dun73]
Pn
uqki xi
mk = Pi=1
k = 1 . . . K,
n
q ,
i=1 uki
E(C) =
K X
n
X
uqki (xi mk )> (xi mk ),
(9.23)
(9.24)
k=1 i=1
and make the transition from hard to fuzzy clustering by replacing (9.6) by continuous
variables
0 < uki < 1, k = 1 . . . K, i = 1 . . . n,
(9.25)
but retaining requirements (9.7) and (9.8). The matrix u is now a fuzzy class membership
matrix.
With i fixed, we seek values for the uki which solve the minimization problem
Ei =
K
X
uqki (xi mk )> (xi mk ) min,
i = 1 . . . n,
k=1
under conditions (9.7). By introducing the Lagrange function

!
K
X
Li = Ei
uki 1
k=1
we can equivalently solve the unconstrained problem Li

respect to uki ,
min. Differentiating with
Li
= q(uki )q1 (xi mk )> (xi mk ) = 0,
uki
k = 1 . . . K,
9.3. EM CLUSTERING
85
s
from which we have
q1
uki =
q1
1
.
(xi mk )> (xi mk )
The Lagrange multiplier is determined by

s
K
K
X
X
q1
1=
uki =
q
k=1
(9.26)
s
q1
k=1
1
,
(xi mk )> (xi mk )
Substituting this into (9.26), we obtain finally

q
q1
uki = P
K
k0 =1
1
(xi mk )> (xi mk )
q1
k = 1 . . . K, i = 1 . . . n.
(9.27)
1
(xi mk
0 )> (x
i mk0 )
The parameter q determines the degree of fuzziness and is usually chosen as q = 2.

The fuzzy K-means (FKM) algorithm consists of a simple iteration of equations (9.23)
and (9.27). The algorithm terminates when the cluster centers mk or alternatively when
the matrix elements uki cease to change significantly. This algorithm should gives similar
results to the K-means algorithm, but one expects it to be less likely to become trapped in
local minima of the cost function.
An ENVI plug-in for fuzzy k-means clustering is given in Appendix D.6.2.
9.3
EM Clustering
The EM (= expectation maximization) algorithm, (see e.g. [Bis95]) replaces uki in (9.27)
by the posterior probability p(Ck | xi ) of class Ck given the observation xi . That is, using
Bayes theorem,
uki p(Ck | xi ) p(xi | Ck )p(Ck ).
Here p(xi | Ck ) is taken to be a multivariate normal distribution function with estimated
mean mk and estimated covariance matrix Fk given by (9.9) and (9.10), respectively. Thus

1
1
uki p(Ck ) p
exp (xi mk )> F1
(x
m
)
.
(9.28)
i
k
k
2
|Fk |
One can use the current class membership to estimate P (Ck ) as pk according to (9.15).
The EM algorithm is then an iteration of equations (9.9), (9.10), (9.15) and (9.28) with
the same termination condition as for the fuzzy K-means algorithm, see also Eqs. (8.12).
After each iteration the columns of u are normalized according to (9.7). Because of the
exponential distance dependence of the membership probabilities in (9.28), the algorithm
is very sensitive to initialization conditions, and can even become unstable. To avoid this
problem, one can first obtain initial values for the mk and for u by preceding the calculation
with the fuzzy K-means algorithm. Explicitly:
Algorithm (EM clustering)
1. Determine starting values for cluster centers mk and initial memberships uki from
the FKM algorithm.
86

2. Determine the cluster centers mk with (9.9) and the prior probabilities P (Ck )
with (9.15).
3. Calculate the weighted covariance matrices Fk with (9.10) and with (9.28) the
class membership probabilities uki . Normalize the columns of u.
4. If u has not changed significantly, stop, else go to 2.
9.3.1
Simulated annealing
Even with initialization using the fuzzy K-means algorithm the EM algorithm may be
trapped in a local optimum. An alternative scheme is to apply so-called simulated annealing.
Essentially the initial memberships are random and only gradually are the calculated class
memberships allowed to influence the estimation of the class centers [Hil01]. The rate of
reduction of randomness is determined by a temperature parameter. For example, the class
memberships in (9.28) may replaced by
uki uki (1 r1/T )
on each iteration, where T is initialized to T0 and reduced at each iteration by a factor c < 1:
T cT
and where r (0, 1) is a uniformly distributed random number. As T approaches zero,
uki will be determined more and more by the probability distribution parameters alone in
(9.28).
9.3.2
Partition density
Since the simple cost function E(C) of (9.12) is no longer relevant, we choose with [GG89]
the partition density as a criterion for choosing the best number of clusters. The fuzzy
hypervolume, defined as
K p
X
F HV =
|Fk |,
k=1
is proportional to the volume in feature space occupied by the ellipsoidal clusters generated
by the algorithm. For instance, for a two dimensional cluster with an elliptical probability
density we have, in its principal axis coordinate system,
s

p
12 0

= 1 2 area (volume) of the ellipse.
|| =
0 22
Summing the memberships of the observations within one standard deviation of each cluster
center,
n X
K
X
S=
uik , i {i | (xi mk )> F1
k (xi mk ) < 1},
i=1 k=1
the partition density is defined as

PD = S/F HV.
(9.29)
well-separated clusters of approximately multivariAssuming that the data consist of K
ate normally distributed pixels, the partition density should exhibit a maximum at K = K.
An ENVI plug-in for EM clustering is given in Appendix D.6.3.
9.3. EM CLUSTERING
9.3.3
87
Including spatial information
The algorithms described thus far make exclusive use of the spectral properties of the individual observations (pixels). Spatial relationships within an image such as large scale,
coherent regions, textures etc. are ignored entirely.
The EM algorithm determines the a posteriori class membership probabilities of each
observation for the classes in question. In this section we describe a post-processing technique
to take account of some of the spatial information implicit in the classified image in order to
improve the original classification. This technique makes use of the vectors of a posteriori
probabilities associated with each classified pixel.
Figure 9.1 shows schematically a single pixel m together with its immediate neighborhood
n, which we take to consist of the four pixels above, below, to the left and to the right of
m. Let its a posteriori probabilities be
Pm (Ck ),
k = 1 . . . M,
M
X
Pm (Ck ) = 1,
k=1
or, more simply, the vector Pm .

Neighborhood n
Pixel m
+
2
1
Figure 9.1: A pixel neighborhood.

A possible misclassification of the pixel m could in principle be corrected by examining
its neighborhood. The neighboring pixels would have in some way to modify Pm such that
the maximal probability corresponds to the true class. We now describe a purely heuristic
but nevertheless intuitively satisfying procedure to do just that, the so-called probabilistic
label relaxation method [JRR99].
Let Qm (Ck ) be a neighborhood function for the mth Pixel, which is supposed to correct
Pm (Ck ) in the above sense, according to the prescription
0
Pm
(Ck ) = P
Pm (Ck )Qm (Ck )

,
0
0
k0 Pm (Ck )Qm (Ck )
k = 1 . . . M,
or, as a vector equation, according to

P0m =
Pm Qm
,
P>
m Qm
(9.30)
where signifies the Hadamard product, which simply means component-by-component

multiplication. The denominator ensures that the result is also a probability, in other words
88
that
M
X
0
Pm
(Ck ) = 1.
k=1
The neighborhood function must somehow reflect the spatial structure of the image. In
order to define it we first postulate a compatibility measure
Pmi (Ck |Cl ),
i = 1 . . . 4,
namely, the conditional probability that pixel m belongs to class Ck , given that the neighboring pixel i, i = 1 . . . 4, belongs to Cl . A small piece of evidence that m should be
classified to Ck would then be
Pmi (Ck |Cl )Pi (Cl ),
i = 1 . . . 4,
that is, the conditional probability that pixel m is in class Ck if neighboring pixel i is in
class Cl , i = 1 . . . 4.
We obtain a Neighborhood function Qm (Ck ) by summing over all pieces of evidence:
1 XX
Pmi (Ck |Cl )Pi (Cl )
4 i=1
4
Qm (Ck ) =
l=1
M
X
(9.31)
Pmn (Ck |Cl )Pn (Cl ),
l=1
where Pn (Cl ) is the average over all four neighborhood pixels:

1X
Pi (Cl ),
4 i=1
4
Pn (Cl ) =
and where Pmn (Ck |Cl ) also corresponds to the average compatibility of pixel m with its
entire neighborhood. We can write (9.31) again as a vector equation,
Qm = Pmn Pn
and (9.30) finally as
P0m =
Pm (Pmn Pn )
.
P>
m Pmn Pn
(9.32)
The matrix of average compatibilities Pmn can be estimated directly from the original
classified image. A random central pixel m is chosen and its calss Ci determined. Then, again
randomly, a pixel j out of its neighborhood its chosen and its class Cj is also determined.
Thereupon the matrix element Pmn (Ci |Cj ) (which was initialized to 0) is incremented by 1.
This is repeated many times and finally the rows of the matrix are normalized.
Equation (10.15) is well-suited for a simple algorithm:
Algorithm (Probabilistic label relaxation)
1. Carry out a supervised classification, e.g. with a FFN, and determine the compatibility matrix Pmn .
9.4. THE KOHONEN SELF ORGANIZING MAP
89
2. Determine the average neighborhood vector Pn of all pixels m and replace Pm

with P0m according to (9.32). Re-classify pixel m according to the largest membership probability in P0m .
3. If only a few re-classifications took place, stop otherwise go to step 2.
The stopping condition in the algorithm is obviously rather arbitrary. Experience shows
that the best results are obtained after 34 iterations, see [JRR99]. Too many iterations lead
to a widening of the effective neighborhood of a pixel to such an extent that fully irrelevant
spatial information falsifies the final product.
The PLR method can be applied similarly to class probabilities generated by supervised
classification algorithms. ENVI plug-ins for probabilistic label relaxation are given
in Appendix D.6.4.
9.4
The Kohonen Self Organizing Map
The Kohonen self organizing map, a simple example of which is sketched in Fig. 9.2 , belongs
to a class of neural networks which are trained by competitive learning, [HKP91, Koh89].
The single layer of neurons can have any geometry, usually one- two- or three-dimensional.
The input signal is represented by the vector
x = (x1 , x2 . . . xN )> .
Each input to a neuron is associated with a synaptic weight, so that for M neurons, the
synaptic weights can represented as a (M N ) matrix
w11
w21
w=
...
w12
w22
..
.
..
.
w1N
w2N
.
..
.
wM 1
wM 2
wM N
The components of the vector wk = (wk1 , wk2 . . . wkN )> are thus the synaptic weights of
the kth neuron.
We interpret the vectors
{x(i)|i = 1 . . . p}.
as training data for the neural network. The synaptic weight vectors are to be adjusted so
as to reflect in some way the clustering of the training data in the N -dimensional feature
space.
When a training vector x is presented to the input of the network, the neuron whose
weight vector wk lies nearest to x is designated to be the winner. Distances are given by
(x wk )> (x wk ).
Call the winner k . Then its weight vector is shifted a small amount in the direction of the
training vector:
wk (i + 1) = wk (i) + (x(i) wk (i)),
where wk (i + 1) is the weight vector after presentation of the ith training vector, see Fig.
9.3. The parameter is called the learning rate of the network.
90

16

k

k

1
2
3
4

OI K 6 3
w

6
6
x1
x2
Figure 9.2: The Kohonen feature map in two dimensions with a two-dimensional input.
The intention is to repeat this learning procedure until the synaptic weight vectors reflect
the class structure of the training data, thus achieving a vector quantization of the feature
space. In order for this method to function, it is necessary to allow the learning rate to
decrease gradually during the training process. A convenient function for this is

i/p
min
(i) = max
.
max
However the Kohonen feature map goes a step further and tries to map the topology of the
feature space onto the network. This is achieved by defining a neighborhood function for the
winner neuron on the network of neurons. Usually a Gauss function of the form
(k , k) = exp(d2 (k , k)/2 2 )
is used, where d2 (k , k) is the square of the distance between neurons k and k. For example,
for a two-dimensional array of m m neurons
d2 (k , k) =[(k 1) mod m (k 1) mod m]2
+ [(k 1) div m (k 1) div m]2 ,
whereas for a cubic m m m array.
d2 (k , k) = [(k 1) mod m (k 1) mod m]2
+ [((k 1) div m (k 1) div m) mod m]2
+ [(k 1) div m2 (k 1) div m2 ]2 .
During the learning phase not only the winner neuron, but also the neurons in its neighborhood are moved in the direction of the training vectors:
wk (i + 1) = wk (i) + (i)(k , k)(x(i) wk (i)), k = 1 . . . M.
9.5. UNSUPERVISED CLASSIFICATION OF CHANGES
91
wk (i)
7
N
>
wk (i + 1)
:
x(i)
Figure 9.3: Movement of synaptic weight vector in the direction of training vector.
Finally, the extent of the neighborhood is allowed to shrink steadily

i/p
min
() = max
.
max
Typically, max m/2 and min 1/2. Thus the neighborhood is initially the entire
network and toward the end of training it is very localized.
For visualization or clustering of multispectral satellite imagery, a cubic network geometry is useful. After training, the image is classified by associating each pixel vector with the
neuron having the closest synaptic weight vector. The pixel is then colored by mapping the
position of the neuron in the cube to coordinates in RGB color space. Thus, pixels that are
close together in feature space are represented by similar colors.
An ENVI plug-in for the Kohonen self organizing map is given in Appendix
D.6.5.
9.5
Unsupervised classification of changes
We mention finally an extension of the procedure used to determine change/no-change decision thresholds discussed in Section 8.4.7. Rather than clustering the MAD change components individually as was done there, we can use any of the algorithms introduced in
this chapter (except the Kohonen SOM) to classify the changes. Because of its ability to
accommodate correlated clusters, we prefer the EM algorithm.
Clustering of the change pixels can of course be applied in the full MAD or MNF/MAD
feature space, where the number of clusters chosen determines the number of change categories. The approximate chi-square distribution of the sum of squares of the standardized
variates allows the labelling of pixels with high no-change probability. These can be excluded from the clustering process e.g. by freezing their a posteriori probabilities to 1
for the no-change class, thereby speeding up the calculation considerably. Routines for
change classification using the EM algorithm are included in the ENVI GUI for
viewing change detection images given in Appendix D.6.6.
92
Chapter 10
Supervised Classification
The pixel-oriented, supervised classification of multispectral images is a problem of probability density estimation. On the basis of representative training data for each class, the
probability distributions for all of the classes are estimated and then used to classify all of
the pixels in the image. We will consider three methods or models for supervised classification: a parametric model (Bayes maximum likelihood), a non-parametric model (Gaussian
kernel) and a mixture model (the feed-forward neural network). The basis for all of these
classifiers is Bayes decision rule, which we consider first.
10.1
Bayes decision rule
The a posteriori probabilities for class Ck , Eq. (2.3), can be written for N -diminsional
training data and M classes in the form
P (Ck |x),
k = 1 . . . M, x = (x1 . . . xN )> .
(10.1)
Let us define a loss function L(Ci , x) which measures the cost of associating the pixel with
feature vector x with the class Ci . Let ik be the loss incurred if x in fact belongs to class
Ci , but is classified as belonging to class Ck . We can reasonably assume

= 0 if i = k
ik
i, k = 1 . . . M,
(10.2)
> 0 otherwise,
that is, a correct classification incurs no loss. We can now express the loss function as a sum
over the individual losses, weighted according to (10.1):
L(Ci , x) =
M
X
ik P (Ck |x).
(10.3)
k=1
Without further specifying ik , we can define a loss-minimizing decision rule for our classification as
x Ci provided L(Ci , x) < L(Cj , x) for all j = 1 . . . M, j 6= i.
(10.4)
Up till now weve been completely general. Now suppose the losses are independent of the
kind of misclassification that occurs (for instance, the classification of a forest pixel into
93
94
CHAPTER 10. SUPERVISED CLASSIFICATION
the the class meadow is just as bad as classifying it as urban area, etc). The we can write
ik = 1 ik , .
Thus any given misclassification (i 6= k) costs unity, and a correct classification (i = k) costs
nothing. We then obtain from (10.3)
L(Ci , x) =
M
X
P (Ck |x) P (Ci |x) = 1 P (Ci |x), i = 1 . . . M,
(10.5)
k=1
and from (10.4) the Bayes decision rule

x Ci if P (Ci |x) > P (Cj |x) for all j = 1 . . . M, j 6= i.
10.2
(10.6)
Training data
The selection of representative training data is the most difficult and critical part of the
classification process. The standard procedure is to select training areas within the image
which are representative of each class of interest. In the ENVI environment, these are
entered as regions of interest (ROIs), from which the training pixel vectors are generated.
Note that some fraction of the representative data must be withheld for later accuracy
assessment. These are the so-called test data, which are not used for training purposes in
order not to bias the accuracy assessment. Well discuss their use in detail in later in this
chapter.
Suppose there are just two classes, that is M = 2. If we apply decision rule (10.6) to
some measured pixel vector x, the probability of incorrectly classifying the pixel is
r(x) = min[P (C1 |x), P (C2 |x)].
The Bayes error is defined to be the average of r(x) over all pixels,
Z
Z
= r(x)p(x)dx = min[P (C1 |x), P (C2 |x)]p(x)dx
Z
= min[P (x|C1 )P (C1 ), P (x|C2 )P (C2 )]dx,
where we used Bayes rule in the last step. We can use the Bayes error as a measure of the
separability of the two classes, the smaller the error, the better the separability.
Calculating the Bayes error is difficult, but we can at least get an approximate upper
bound as follows. First note that, for any a, b 0,
min[a, b] aS b1S ,
0 S 1.
For example, if a < b, then the inequality can be written

1S
b
aa
a
which is clearly true. Applying the inequality to the expression for the Bayes error, we get
the so-called Chernoff bound
Z
u = P (C1 )S P (C2 )1S P (x|C1 )S P (x|C2 )1S dx.
10.3. BAYES MAXIMUM LIKELIHOOD CLASSIFICATION
95
The best upper bound is then determined by minimizing u with respect to S. If we assume
that P (x|C1 ) and P (x|C2 ) are normal distributions with 1 = 2 , then the minimum occurs
at S = 1/2.
We get the Bhattacharyya bound B by using S = 1/2 also for the case where 1 6= 2 :
Z p
p
B = P (C1 )P (C2 )
P (x|C1 )P (x|C2 ) dx.
This integral can be evaluated explicitly. The result is
p
B = P (C1 )P (C2 )eB ,
where B is the Bhattacharyya distance given by
!

,

1
1 + 2 p
1
1 + 2
1
>

(2 1 ) + log
|1 ||2 | .
B = (2 1 )

8
2
2
2
The first term is an average Mahalinobis distance (see below), the second term depends
on the difference between the covariance matrices of the two classes. It vanishes when
1 = 2 . Thus the first term gives the class separability due due the distance between
the class means, while the second term gives the separability due to the difference in the
covariance matrices.
Finally, the Jeffries-Matusita distance measures separability of two classes on a scale
[0 2] in terms of B:
J = 2(1 eB ).
(10.7)
The ENVI menu command
Basic Tools/Region of Interest/Compute ROI Separability
calculates Jeffries-Matusita distances between all pairs of classes defined by a given set of
ROIs.
10.3
Bayes Maximum likelihood classification
Consider again Bayes rule:

P (Ci |x) =
P (x|Ci )P (Ci )
P (x)
where P (Ci ), i = 1 . . . M , are a priori probabilities and where P (x) is given by

P (x) =
M
X
P (x|Cj )P (Cj ).
j=1
Since P (x) is independent of i, we can write the decision rule (10.6) as

x Ci if P (x|Ci )P (Ci ) > P (x|Cj )P (Cj ) for all j = 1 . . . M, j 6= i.
Now we make two simplifying assumptions: first, that all the a priori probabilities are
equal, second, that the measured feature vectors from class Ci have been sampled from a
multivariate normal distribution, that is, that they satisfy

1
1
> 1
P (x|Ci ) =
exp
(x
(x
)
.
(10.8)
i
i
i
2
(2)N/2 |i |1/2
96
According to the first assumption, we only need to associate x to that class Ci which
maximizes P (x|Ci ):
x Ci if P (x|Ci ) > P (x|Cj ) for all j = 1 . . . M, j 6= i.
(10.9)
Taking the logarithm of (10.8) gives

1
N
1
log P (x|Ci ) = log(2) log |i | (x i )> 1
i (x i )
2
2
2
and we can ignore the first term, as it is independent of i. With the definition of a discriminant function di (x),
di (x) = log |i | (x i )> 1
i (x i ),
(10.10)
we obtain finally the Bayes maximum-likelihood classifier:

x Ci if di (x) > dj (x) for all j = 1 . . . M, j 6= i.
(10.11)
The expression (xi )> 1

i (xi ) in (6.10) is referred to as the Mahalanobis distance. The
moments of the distributions for the M classes, i and i , which appear in the discriminant
function (10.10), may be estimated from the training data using the maximum likelihood
estimates:
1 X
i mi =
x
ni
xCi
1 X
i Fi =
(x i )(x i )> ,
ni
xCi
where ni is the number of training pixels in class Ci .

The maximum likelihood classification algorithm can be called from the ENVI main
menu with
Classification/Supervised/Maximum Likelihood.
10.4
Non-parametric methods
In non-parametric density estimation we wish to model the probability distribution generated by a given set of training data, without making any prior assumption about the form of
the distribution function. An example is the class of kernel based methods. Here each data
point is used as the center of a simple local probability density and the overall distribution
is taken to be the sum of the local distributions. In N dimensions, we can model the class
probability distribution as
P (y|Ci )
>
2
1 X
1
e(yx) (yx)/2 .
2
N/2
ni
(2 )
xC
i
The quantity is a smoothing parameter which we can choose for example by minimizing
the misclassifications on the training data themselves with respect to .
The kernel based method suffers from the drawback of requiring all training data points
to be stored. This makes the evaluation of the density very slow if the number of training
pixels is large. In general, the complexity grows with the amount of data, not with the
difficulty of the estimation problem itself.
10.5. NEURAL NETWORKS
10.5
97
Neural networks
Neural networks belong to the category of mixture models for probability density estimation,
which lie somewhere between the parametric and non-parametric extremes. They make no
assumption about the functional form of the probabilities and can be adjusted flexibly to
the complexity of the system that they are being used to model.
To motivate their use for classification, consider two classes C1 and C2 in a two-dimensional
feature sspace. We could write (10.11) in the form of a discriminant
m(x) = d1 (x) d2 (x)
and say that

x is
C1
C2
if m(x) > 0
if m(x) < 0.
A much simpler discriminant is the linear function:

m(x) = w0 w1 x1 + w2 x2 = w0 + w> x,
(10.12)
where w = (w1 , w2 )> and w0 are parameters. The decision boundary occurs for m(x) = 0,
i.e. for
w1
w0
x2 = x1
,
w2
w2
see Figure 10.1
w0
w
2
u
u e u
u
e
u
e
m(x) = 0
e
e e
e
e w1
e
e
e
w2
Figure 10.1: A linear discriminant for two classes.

In N dimensions
m(x) = w0 + w1 x1 + . . . + wN xN = w> x + w0
and we can represent this equation schematically as an artificial neuron as in shown Figure
10.2. The parameters wi are referred to as synaptic weights.
Usually, the output m(x) is modified by a so-called sigmoid activation function, for
example the logistic function
1
g(x) =
,
1 + eI(x)
98
1
x1
xi
xN

- 0

w0
- 1
w1

..
.
wi
- i
wN
..
.

- N

~
m(x)
q
:
>
Figure 10.2: An artificial neuron. The first input is always unity and is called the bias.
where
I(x) = w> x + w0 .
This is sometimes justified by the analogy to biological neurons. In IDL (see Figure 10.3):
thisDevice =!D.Name
set_plot, PS
Device, Filename=c:\temp\logistic.eps,xsize=15,ysize=10,/Encapsulated
x=(findgen(100)-50)/10
plot, x,1/(1+exp(-x))
device,/close_file
set_plot,thisDevice
Figure 10.3: The logistic activation function.
99
There is also a statistical justification, however [Bis95]. Suppose two classes in twodimensional feature space are normally distributed with 1 = 2 = I,
P (x|Ck )
|x k |2
1
exp(
),
2
2
k = 1, 2.
Then we have
P (x|C1 )P (C1 )
P (x|C1 )P (C1 ) + P (x|C2 )P (C2 )
1
=
1 + P (x|C2 )P (C2 )/P (x|C1 )P (C1 )
1
=
.
1
2
1 + exp( 2 [(x 2 ) (x 1 )2 ])(P (C2 )/P (C1 ))
P (C1 |x) =
With the substitution
ea = (P (C2 )/P (C1 ))
we get
1
1+
2 |2 |x 1 |2 ] a)
1
1
=
=
>
1 + exp(w x w0 )
1 + eI(x)
= m(x).
P (C1 |x) =
exp( 12 [|x
Here we made the additional substitutions

w = 1 2
1
1
w0 = |1 |2 + |2 |2 + a.
2
2
Thus we expect that the output of the neuron will not only discriminate between the two
classes, but also that it will approximate the posterior class membership probability P (C1 |x).
10.5.1
The feed-forward network
In order to discriminate any number of classes, multilayer feed-forward networks are often
used, see Figure 10.4. In this figure, the input signal is the N + 1-component vector
x(`) = (1, x1 (`) . . . xN (`))>
for training sample `, which is fed simultaneously to the so-called hidden layer consisting of
L neurons. These in turn determine the L + 1-component vector
n(x) = (1, n1 (x) . . . nL (x))>
according to
nj (x) = g(Ijh (x)),
j = 1 . . . L,
with
Ijh (x) = wjh> x,
where wh> is the hidden weight vector for the jth neuron
h >
wh = (w0h , w1h . . . wL
) .
100
#
-
"!
#
#
1
"!
#
x1 (`)
xN (`)
- 1
*

"!
i
"!
..
.
#
N
"!
n1
#
q
- m1 (`)
1
>
"!
..
.
..
.
"!
..
.
#
xi (`)
Wo
#
~
q
: j
>"!
#
w
nj R
- k
- mk (`)
>
"!
..
.
..
.
#
w
s L
"!
nL
#
U
- mM (`)
RU M
"!
Figure 10.4: A two-layer, feed-forward neural network with L hidden neurons for classification of N -dimensional data into M classes.
In terms of the weight matrices

h
Wh = (w1h , w2h , . . . wL
),
we can write this compactly as

n=
o
Wo = (w1o , w2o , . . . wM
),
1
g(Wh> x)

.
The vector n is then fed to the output layer. If we interpret the outputs as probabilities,
then we must ensure that
0 mk 1, k = 1 . . . M,
and, furthermore, that
M
X
mk = 1.
k=1
This can be done by using a modified logistic activation function for the output neurons,
called softmax:
o
eIk (n)
,
mk (n) = I o (n)
o
o
e1
+ eI2 (n) + . . . + eIM (n)
where
Iko (n) = wko> n.
101
To quote Bishop [Bis95],

... such networks can approximate arbitrarily well any functional continuous mapping from one finite dimensional space to another, provided the number
L of hidden units is sufficiently large. An important corollary of this result is,
that in the context of a classification problem, networks with sigmoidal nonlinearities and two layers of weights can approximate any decision boundary to
arbitrary accuracy. More generally, the capability of such networks to approximate general smooth functions allows them to model posterior probabilities of
class membership.
10.5.2
Cost functions
We havent yet considered the correct choice of synaptic weights. This procedure is called
training the network. The training data can be represented as the set of labelled pairs
{(x(`), y(`)) | ` = 1 . . . p},
where
y(`) = (0, 0 . . . 0, 1, 0 . . . 0)>
is an M -dimensional vector of zeroes, with a 1 at the kth position to indicate that x(`)
belongs to class Ck . An intuitive training criterion is then the quadratic cost function
1X
ky(`) m(`)k2 .
2
p
E(Wh , Wo ) =
(10.13)
`=1
We must adjust the network weights so as to minimize E. Equivalently we can minimize

the local cost functions
E(`) :=
1
ky(`) m(`)k2 ,
2
` = 1 . . . p.
(10.14)
An alternative cost function can be obtained with the following argument: Choose the
synaptic weights so as to maximize the probability of observing the training data:
P (x(`), y(`)) = P (y(`) | x(`))P (x(`)) max .
The neural network predicts the posterior class membership probability, which we can write
as
M
Y
P (y(`) | x(`)) =
[ mk (x(`)) ]yk (`) .
k=1
For example:
P ((1, 0 . . . 0)> |x) = m1 (x)1 m2 (x)0 smM (x)0 = m1 (x).
Therefore we wish to maximize
M
Y
k=1
[ mk (x(i)) ]yk (i) P (x(`))
102
Taking logarithms, dropping terms which are independent of the synaptic weights and summing over all of the training data, we see that this is equivalent to minimizing the cross
entropy cost function
E(W , W ) =
h
p X
M
X
yk (`) log mk (x(`))
(10.15)
`=1 k=1
with respect to the synaptic weight parameters.
10.5.3
Training
Let w be the vector of all synaptic weights, i.e.

E(Wh , Wo ) =: E(w)
In one dimension, expanding in a Taylor series about a local minimum w ,
E(w) = E(w ) + (w w )
dE(w ) 1
d2 E(w )
+ (w w )2
+ ...
dw
2
dw2
1
= E0 + (w w )2 H + . . . ,
2
where H =
d2 E(w )
dw2
and we must have H > 0 for a minimum, see Figure 10.5

E(w)
dE(w )
dw
=0
w
Figure 10.5: Minimization of the cost function in one dimension.

In many dimensions, we get the analogous expression
E(w) = E0 + (w w )> H(w w ) + . . .
where the matrix H is called the Hessian,
Hij =
2 E(w )
.
wi wj
(10.16)
103
It is symmetric and it must be positive definite for a local minimum. It is positive definite
if all of its eigenvalues are positive, see Appendix C.
A local minimum can be found with various search algorithms. Backpropagation is the
most well-known and extensively used method and is described below. It is used in the standard ENVI neural network for supervised classification. However much better algorithms
exist, such as scaled conjugate gradient or Kalman filter. These are discussed in detail in
Appendix C. ENVI plug-ins for supervised classification with a feed forward neural network trained with conjugate gradient and a fast Kalman filter algorithm
are given in Appendices D.7 and D.8.
10.5.4
Backpropagation
We will develop a training algorithm for the two-layer, feed-forward neural network of Figure
10.4. Our starting point is the local version of the cost function (10.15),
E(`) =
M
X
yk (`) log mk (`),
` = 1 . . . p,
(10.17)
k=1
or, in vector form
E(`) = y> (`) log m(`),
which we wish to minimize with respect to the synaptic weights represented by the (N +1)L
h
o
matrix Wh = (w1h , w2h , . . . wL
) and the (L + 1) M matrix Wo = (w1o , w2o , . . . wM
).
The following IDL object class FFN mirrors the network architecture of Figure 10.4 and
will form the basis for the implementation of the training algorithms developed here and in
Appendix C:
;+
; NAME:
;
FFN__DEFINE
; PURPOSE:
;
Object class for implementation of a two-layer, feed-forward
;
neural network for classification of multi-spectral images.
;
This is a generic class with no training methods.
;
Ref: M. Canty, Fernerkundung mit neuronalen Netzen, Expert 1999
; AUTHOR
;
Mort Canty (2005)
;
Juelich Research Center
;
; CALLING SEQUENCE:
ffn = Obj_New("FFN",Xs,Ys,L)
;
; ARGUMENTS:
;
Xs: array of observation column vectors
;
Ys: array of class label column vectors of form (0,0,1,0,0,...0)^T
;
L:
number of hidden neurons
; KEYWORDS
;
None
; METHODS (external):
;
FORWARDPASS: propagate a biased input column vector through the network
104
;
returns the softmax probabilities vector
;
m = ffn -> ForwardPass()
;
CLASS: return the class for an for an array of observation column vectors X
;
return the class probabilities in array variable PROBS
;
c = ffn -> Class(X,Probs)
COST: return the current cross entropy
;
;
c = ffn -> Cost()
; DEPENDENCIES:
;
None
;--------------------------------------------------------------
Function FFN::Init, Xs, Ys, L

catch, theError
if theError ne 0 then begin
catch, /cancel
ok = dialog_message(!Error_State.Msg + Returning..., /error)
return, 0
endif
; network architecture
self.LL = L
self.p = n_elements(Xs[*,0])
self.NN = n_elements(Xs[0,*])
self.MM = n_elements(Ys[0,*])
; biased output vector from hidden layer (column vector)
self.N= ptr_new(fltarr(L+1))
; biased exemplars (column vectors)
self.Xs = ptr_new([[fltarr(self.p)+1],[Xs]])
self.Ys = ptr_new(Ys)
; weight matrices (each column is a neuron weight vector)
self.Wh = ptr_new(randomu(seed,L,self.NN+1)-0.5)
self.Wo = ptr_new(randomu(seed,self.MM,L+1)-0.5)
return,1
End
Pro FFN::Cleanup
ptr_free, self.Xs
ptr_free, self.Ys
ptr_free, self.Wh
ptr_free, self.Wo
ptr_free, self.N
End
Function FFN::forwardPass, x
; logistic activation for hidden neurons, N set as side effect
*self.N = [[1],[1/(1+exp(-transpose(*self.Wh)##x))]]
; softmax activation for output neurons
I = transpose(*self.Wo)##*self.N
A = exp(I-max(I))
return, A/total(A)
105
End
Function FFN:: class, X, Probs
; vectorized class membership probabilities
nx = n_elements(X[*,0])
Ones = fltarr(nx) + 1.0
N = [[Ones],[1/(1+exp(-transpose(*self.Wh)##[[Ones],[X]]))]]
Io = transpose(*self.Wo)##N
maxIo = max(Io,dimension=2)
for k=0,self.MM-1 do Io[*,k]=Io[*,k]-maxIo
A = exp(Io)
sum = total(A,2)
Probs = fltarr(nx,self.MM)
for k=0,self.MM-1 do Probs[*,k] = A[*,k]/sum
; vectorized class memberships
maxM = max(Probs,dimension=2)
M=fltarr(self.MM,nx)
for i=0,self.MM-1 do M[i,*]=Probs[*,i]-maxM
return, byte((where(M eq 0.0) mod self.MM)+1)
End
Function FFN:: cost
Ones = fltarr(self.p) + 1.0
N = [[Ones],[1/(1+exp(-transpose(*self.Wh)##[*self.Xs]))]]
Io = transpose(*self.Wo)##N
A = exp(Io)
sum = total(A,2)
Ms = fltarr(self.p,self.MM)
for k=0,self.MM-1 do Ms[*,k] = A[*,k]/sum
return, -total((*self.Ys)*alog(Ms))
End
Pro FFN__Define
struct = { FFN, $
NN: 0L,
LL: 0L,
MM: 0L,
Wh:ptr_new(),
Wo:ptr_new(),
Xs:ptr_new(),
Ys:ptr_new(),
N:ptr_new(),
p: 0L
}
End
$
$
$
$
$
$
$
$
$
;input dimension
;number of hidden units
;output dimension
;hidden weights
;output weights
;training pairs
;output vector from hidden layer
;number of training pairs
106

Consider the following algorithm:
Algorithm (Backpropagation or Generalized Least Mean Square)

1. Initialize the synaptic weights with random numbers and set ` = 1.
2. Choose training pair (x(`), y(`)) and determine the output response m(`) of the network.
o
o
3. For k = 1 . . . M and j = 0 . . . L replace wjk
with wjk
E(`)
wo .
jk
4. For j = 1 . . . L and i = 0 . . . N replace
h
wij
with
h
wij
E(`)
h .
wij
5. If E(Wh , Wo ) is sufficiently small, stop, otherwise set ` = ` mod p + 1 and go to 2.

Thus we keep cycling through the training data, reducing the local cost function at each step
by changing each synaptic weight by an amount proportional to the negative slope of the
cost function with respect to that weight parameter, stopping when the overall cost function
(10.15) is small enough. The constant of proportionality is referred to as the learning rate
for the network. This algorithm makes use only of the first derivatives of the cost function
with respect to the synaptic weight parameters and is referred to as the backpropagation
method.
In order to implement this procedure, we require the partial derivatives of E(`) with
respect to the synaptic weights. Let us begin with the output neurons, for which we have
the softmax output signals
o
eIk (`)
mk (`) = I o (`)
,
o
o
e 1 + eI2 (`) + . . . + eIM (`)
(10.18)
where the activation of the kth neuron is

Iko (`) = wko> n(`).
We wish to determine
E(`)
o ,
wjk
j = 0 . . . L, k = 1 . . . M.
Applying the chain rule

E(`) Iko (`)
E(`)
=
= ko (`)n(`),
wko
Iko (`) wko
k = 1 . . . M,
(10.19)
where the quantity ko (`) is defined as

ko (`) =
E(`)
Iko (`)
and is the negative rate of change of the local cost function with respect to the activation
of the kth output neuron. Again applying the chain rule and with (10.16) and (10.18),
M
X
E(`)
E(`) mk0 (`)
=
o
Iko (`)
m
k0 (`) Ik (`)
0
k =1
M
X
k0 =1
yk0 (`)
mk0 (`)
eIk0 (`) eIk (`)

eIk (`) kk0
PM
PM
o
Iko00 (`)
( k00 =1 eIk00 (`) )2
k00 =1 e
!
.
107
Here, kk0 is the Kronecker symbol

n
kk0 =
0 if k =
6 k0
1 if k = k 0 .
Continuing,
M
M
X
X
yk0 (`)
E(`)
0
0
=
m
(`)(
m
(`))
=
y
(`)
+
m
(`)
yk0 (`).
k
kk
k
k
k
Iko (`)
mk0 (`)
0
0
k =1
k =1
But the sum over M is just one, and we have

E(`)
= (yk (`) mk (`) = ko (`),
Iko (`)
o
and therefore, with o (`) = (1o (`), . . . M
(`))> ,
o (`) = y(`) m(`).
(10.20)
Thus from (10.19) the third step in the backpropagation algorithm can be written in the
form
Wo (` + 1) Wo (`) + n(`) o> (`).
(10.21)
Note that the second term on the right hand side of (10.21) is an outer product, giving a
matrix of dimension (L + 1) M matching that of Wo .
For the hidden weights, step 4 of the algorithm, we proceed similarly:
E(`) Ijh (`)
E(`)
=
= jh (`)x(`),
h
Wj
Ijh (`) Wjh
j = 1 . . . L,
(10.22)
where jh (`) is the negative rate of change of the local cost function with respect to the
activation of the jth hidden neuron:
jh (`) =
E(`)
.
Ijh (`)
Applying again the chain rule:

jh (`) =
M
M
M
X
E(`) Iko (`) X o Iko (`) X o wko> n(`)
=
(`)
=
k (`)
.
k
Ik0 (`) Ijh (`)
Ijh (`)
Ijh (`)
k=1
k=1
k=1
Since only the output of the jth hidden neuron is a function of Ijh (`) = wjh> x(`), we have
jh (`) =
M
X
o
ko (`)wjk
k=1
nj (`)
.
Ijh (`)
The hidden units use the logistic activation function

nj (Ijh ) =
1
h
1 + eIj
108
for which
dnj
= n(x)(1 n(x)).
dx
Therefore we can write

jh (`) =
M
X
o
ko (`)wjk
nj (`)(1 nj (`)),
k=1
or, more compactly,

jh (`) = (wjo o (`)) nj (`)(1 nj (`)).
More compactly still, we can write

0
= n(`) (1 n(`)) Wo o (`) .
h (`)
(10.23)
Note that the fact that 1 n0 (`) = 0 is made explicit in the above expression. Equation
(10.23) is the origin of the term backpropagation, since it propagates the output error o
backwards through the network to determine the hidden unit error h .
Finally, with (10.22) we obtain the update rule for step 4 of the backpropagation algorithm,
Wh (` + 1) Wh (`) + x(`) h> (`).
(10.24)
The choice of an appropriate learning rate is problematic: small values imply slow
convergence and large values produce oscillation. Some improvement can be achieved with
an additional parameter called momentum. We replace (10.21) with
Wo (` + 1) := Wo (`) + o (`) + o (` 1),
(10.25)
where
o (`) = n(`) o> (`),
and is the momentum parameter. A similar expression replaces (10.24). Typical choices
for the backpropagation parameters are = 0.01 and = 0.5.
Here is an object class extending FFN which implements backpropagation:
;+
; NAME:
;
FFNBP__DEFINE
; PURPOSE:
;
;
;
Implements ordinary backpropagation training.
;
Extends the class FFN
;
; AUTHOR
;
Mort Canty (2005)
;
;
109
; CALLING SEQUENCE:
;
ffn = Obj_New("FFNBP",Xs,Ys,L)
; ARGUMENTS:
;
;
Xs: array of class label column vectors of form (0,0,1,0,0,...0)^T
L:
;
; KEYWORDS
;
None
; METHODS:
;
TRAIN: train the network
;
ffn -> train
; DEPENDENCIES:
;
FFN__DEFINE
;
PROGRESSBAR (FSC_COLOR)
;-------------------------------------------------------------Function FFNBP::Init, Xs, Ys, L
catch, theError
if theError ne 0 then begin
catch, /cancel
ok = dialog_message(!Error_State.Msg + Returning..., /error)
return, 0
endif
; initialize the superclass
if not self->FFN::Init(Xs, Ys, L) then return, 0
self.iterations = 10*self.p
self.cost_array = ptr_new(fltarr((self.iterations+100)/100))
return, 1
End
Pro FFNBP::Cleanup
ptr_free, self.cost_array
self->FFN::Cleanup
End
Pro FFNBP::Train
iter = 0L
iter100 = 0L
eta = 0.01
; learn rate
alpha = 0.5 ; momentum
title=Training: exemplar number...,xsize=250,ysize=20)
progressbar->start
window,12,xsize=400,ysize=400,title=Cost Function
wset,12
inc_o1 = 0
inc_h1 = 0
repeat begin
print,Training interrupted
110
return
endif
; select exemplar pair at random
ell = long(self.p*randomu(seed))
x=(*self.Xs)[ell,*]
y=(*self.Ys)[ell,*]
; send it through the network
m=self->forwardPass(x)
; determine the deltas
d_o = y - m
d_h = (*self.N*(1-*self.N)*(*self.Wo##d_o))[1:self.LL] ; d_h is now a row vector
; update the synaptic weights
inc_o = eta*(*self.N##transpose(d_o))
inc_h = eta*(x##d_h)
*self.Wo = *self.Wo + inc_o + alpha*inc_o1
*self.Wh = *self.Wh + inc_h + alpha*inc_h1
inc_o1 = inc_o
inc_h1 = inc_h
; record cost history
if iter mod 100 eq 0 then begin
(*self.cost_array)[iter100]=alog10(self->cost())
iter100 = iter100+1
progressbar->Update,iter*100/self.iterations,text=strtrim(iter,2)
plot,*self.cost_array,xrange=[0,iter100],color=0,background=FFFFFFXL,$
xtitle=Iterations/100),ytitle=log(cross entropy)
end
iter=iter+1
endrep until iter eq self.iterations
progressbar->destroy
End
Pro FFNBP__Define
struct = { FFNBP, $
cost_array: ptr_new(), $
iterations: 0L, $
Inherits FFN $
}
End
In the Train method, the training pairs are chosen at random, rather than cyclically as
indicated in the backpropagation Algorithm.
10.6
Evaluation
The rate of misclassification offers us a reasonable and obvious basis not only for evaluating
the quality of classifiers, but also for their comparison, for example to compare the feedforward network with Bayes maximum-likelihood. We shall characterize this rate in the
following with the parameter . Through classification of test data which have not been
10.6. EVALUATION
111
used for training, we can obtain unbiased estimates of . If, for n test data, y are found to
have been misclassified, then an intuitive value for this estimate is
=: .
n
(10.26)
However the estimated misclassification rates alone are insufficient for model comparison.
We require their uncertainties as well.
10.6.1
Standard deviation of misclassification
The classification of a single test datum is a random experiment, whose possible result we
A}: A=
misclassified, A = correctly classified. We define a
can characterize as the set {A,
real-valued function on this set, i.e. a random variable
= 1,
X(A)
X(A) = 0,
(10.27)
with probabilities
P (X = 1) = = 1 P (X = 0).
The expectation value of this random variable is
hXi = 1 + 0(1 ) =
(10.28)
var(X) = hX 2 i hXi2 = 12 + 02 (1 ) 2 = (1 ).
(10.29)
and its variance is
For the classification of n test data, denoted by random variables X1 . . . Xn , the random
variable
(10.30)
Y = X1 + X2 + . . . Xn
is clearly the associated number of misclassifications. Since
hY i = hX1 i + . . . + hXn i = n
we obtain
y
1
= hY i =
n
n
as an unbiased estimate of the rate of misclassifications.
From the independence of the Xi , i = 1 . . . n, the variance of Y is given by
(10.31)
var(Y ) = var(X1 ) + . . . + var(Xn ) = n(1 ),

and the variance of the misclassification rate is
2 2
Y
1
1
Y
Y
var
= 2 (hY 2 i hY i2 ) = 2 var(Y ),
=
2
n
n
n
n
n

or
var
Y
n

=
(1 )
.
n
(10.32)
For y observed misclassifications we estimate with (10.31). Then the estimated variance
is given by

y
y
)
Y
y(n y)
(1
n 1 n
var
=
=
,
n
n
n
n3
112
and the estimated standard deviation by

r
y(n y)
.
n3
(10.33)
The random variable Y is binomially distributed. However for a sufficiently large number
n of test data, the binomial distribution is well-approximated by the normal distribution.
Mean and standard deviation are then sufficient to characterize the distribution function
completely.
10.6.2
Model comparison
A typical value for a misclassification rate is around 0.5. In order to claim that two
values differ from one another significantly, they should lie at least about two standard
deviations apart. If we wish to discriminate values separated by say 0.01, then
should be
no greater than 0.005. From (10.32) this means
0.0052
0.05(1 0.05)
,
n
or n 2000. Thats quite a few. However since we are dealing with pixel data, such a
number of test pixels assuming sufficient training areas are available is quite realistic.
If training and test data are in fact at a premium, there exist efficient alternatives1 to the
simple train-and-test philosophy presented here. However, since they are generally quite
computer-intensive, we wont consider them further.
In order to express the claim that classifier A is better than classifier B more precisely, we
can formulate an hypothesis test. The individual misclassification rates are approximately
normally distributed. If they are also independent we can construct a test statistic S given
by
YA /n YB /n + A B
YA /n YB /n + A B
S= p
=p
.
var(YA /n YB /n)
var(YA /n) + var(YB /n)
We can then use S to decide between the null hypothesis
H0 : A = B ,
i.e., the two classifiers are equivalent,
and the alternative hypothesis

H1 : A > B or A < B ,
i.e. one of the two methods is better.
Thus under H0 we have S N (0, 1). We choose a decision threshold Z/2 which corresponds to a probability of an error of the first kind. With this probability the null
hypothesis will be rejected although it is in fact true, see Figure 10.6.
In fact the strict independence of the misclassification rates A and B is not given, since
they are determined with the same set of test data. The above hypothesis test with the
statistic S is therefore too conservative. For dependence we have namely
var(YA /n YB /n) = var(YA /n) + var(YB /n) 2cov(YA /n, YB /n),
1 The buzz-words here are Cross-Validation and Bootstrapping, see [WK91], Chapter 2, for an excellent
introduction.
10.6. EVALUATION
113
(S)
acceptance region
Z/2

Z/2
w
/2
/2

Figure 10.6: Acceptance region for the first hypothesis test. If Z/2 S Z/2 , the null
hypothesis is accepted, otherwise it is rejected.
where the covariance term cov(YA /n, YB /n) is positive. The test statistic S is correspondingly underestimated.
We can formulate a non-parametric hypothesis test which avoids this problem of dependence. We distinguish the following events for classification of the test data:
AB
und AB.
AB,
AB,
is the event test observation is misclassified by A and correctly classified
The variable AB
is the event test observation is correctly classified by A and misclassified by
by B, while AB
B and so on. As before we define random variables:
XAB
, XA
, XA B
B
and XAB
where
= 1,
XAB
(AB)
= XAB
B)
= XAB
XAB
(AB)
(A
(AB) = 0,
with probabilities
P (XAB
= 1) = AB
= 1 P (XAB
= 0).
Corresponding definitions are made for XAB , XAB and XAB .
and AB.
If
Now, in comparing the two classifiers we are interested in the events AB
the number of former is significantly smaller than the number of the latter, then A is better
in which both methods perform poorly are excluded.
than B and vice versa. Events AB
For n test observations the random variables
YAB
= XAB
1 + . . . XAB
n
and
YAB = XAB 1 + . . . XAB n

are the frequencies of the respective events. We then have
hYAB
i = nAB
,
var(YAB
) = nAB
(1 AB
)
hYAB i = nAB ,
var(YAB ) = nAB (1 AB ).
114
We expect that AB
1, that is, var(YAB
) nAB
= hYAB
i. The same goes for YAB
. For
a sufficiently large number of test observationss, the random variables
YAB
hYAB
i
p
hYAB
i
and
YAB hYAB i
p
hYAB i
are thus approximately standard normally distributed.

Under the null hypothesis (equivalence of the two classifiers), the expectation values of
YAB
satisfy
and YAB
hYAB
i =: hY i.
i = hYAB
Therefore we form the test statistic
S=
2
(YAB
(Y hY i)2
hY i)
+ AB
.
hY i
hY i
This statistic, being the sum squares of approximately normally distributed random variables, is chi-square distributed, see Chapter 2.
Let yAB
and yAB
be the number of events actually measured. Then we estimate hY i as
y + yAB
hY i = AB
2
and determine our test statistic as
yAB
+yAB
2
+yAB
2
(yAB
)
(yAB yAB
)

2
2
S =
+
.
yAB
+y
y
+y
AB
AB
AB
2
With a little algebra we get
(y yAB )2
,
S = AB
yAB
+ y AB
(10.34)
the so-called McNemar statistic. It is chi-square distributed with one degree of freedom,
see for example [Sie65]. A so-called continuity correction is usually made to (10.34) and S
written as
2
(|yAB
| 1)
y AB
S =
.
yAB
+ y AB
But there are still reservations! We can only conclude that one classifier is or is not
superior, relative to the common set of training data. We havent taken into account the
variability of the training data, which were sampled just once from their underlying distributions, only that of the test data. If one or both of the classifiers is a neural network, we
have also not considered the variability of the neural network training procedure with respect to the random initialization of the synaptic weights. All this constitutes an extremely
computation-intensive task [Rip96].
10.6.3
Confusion matrices
The confusion matrix for M classes is defined as
c11 c12
c21 c22
C=
..
...
.
cM 1
cM 2
s
s
..
.
c1M
c2M
..
.
cM M
10.6. EVALUATION
115
where cij is the number of test pixels from class Ci which are classified as Cj . The misclassification rate is
PM
n i=1 cii
y
n Tr C
= =
=
n
n
n
and only takes into account of the diagonal elements of the confusion matrix.
The Kappa-coefficient make use of all the matrix elements. It is defined as follows:
=
correct classifications chance correct classifications

1 chance correct classifications
For a purely randomly labeled test pixel, the proportion of correct classifications is approximately
M
X
ci ci
,
n2
i=1
where
ci =
M
X
cij ,
ci =
j=1
M
X
cji .
j=1
Hence an estimate of the Kappa coefficient is

P cii P ci ci
= i n P ciici n .
1 i n2
(10.35)
Again, the Kappa coefficient alone tells us little about the quality of the classifier. We
require its standard deviation. This can be calculated in the large sample limit n to
be [BFH75]
!
1 1 (1 1 ) 2(1 1 )(21 2 3 ) (1 1 )2 (4 422 )
=
+
+
,
(10.36)
n (1 2 )2
(1 3 )3
(1 2 )4
where
1 =
M
X
cii
i=1
2 =
M
X
ci ci
i=1
3 =
M
X
cii (ci + ci )
i=1
4 =
M
X
i,j=1
cij (cj + ci )2 .
116
Chapter 11
Hyperspectral analysis
Hyperspectral as opposed to multispectral images combine both high or moderate spatial
resolution with high spectral resolution. Typical sensors (imaging spectrometers) generate
in excess of two hundred spectral channels. Figure 11.1 shows part of a so-called image
cube for the AVIRIS (Airborne Visible/Infrared Imaging Spectrometer) sensor taken over
a region of the Californian coast. Sensors of this kind produce much more complex data
and provide correspondingly much more information about the reflecting surfaces examined.
Figure 11.2 displays the spectrum of a single pixel in the image.
Figure 11.1: AVIRIS hyperspectral image cube, Santa Monica Mountains.
117
118
CHAPTER 11. HYPERSPECTRAL ANALYSIS
Figure 11.2: AVIRIS spectrum of one pixel location.
11.1
Mixture modelling
In working with multispectral images, the fact that at the scale of observation a pixel contains
a mixture of materials is generally treated as a second order effect and more or less ignored.
With the availability of high spectral resolution sensors it has become possible to treat the
problem of the mixed pixel quantitatively.
The basic premise of mixture modelling is that within a given scene, the surface is
dominated by a small number of common materials that have relatively constant spectral
properties. These are referred to as the end-members. It is assumed that the spectral
variability captured by the remote sensing system can be modelled by mixtures of these
components.
11.1.1
Full linear unmixing
Suppose that there are p end-members and ` spectral bands. Then we can denote the
spectrum of the ith end-member by the vector
i
m1
..
i
m = . .
mi`
Now define the matrix of end-members M according
1
m1
..
1
p
M = (m . . . m ) = .
m1`
to
s
..
.
mp1
.. ,
.
mp`
with one column for each end-member. For hyperspectral imagery we always have p `.
11.1. MIXTURE MODELLING
119
The measured signal g is modelled as a linear combination of end-members plus a residual

noise term:
g = 1 m1 + . . . + p mp + n = M + n.
The residual n is assumed to be normally distributed with covariance matrix
2 0
s
0
1
0
0 22 s
n =
. .
.. . .
..
. ..
.
.
0
0
s `2
n and the square of the standardized residual is
The standardized residual is 1/2
n
(1/2
n)> (1/2
n) = n> 1
n
n
n n.
The mixing coefficients are determined my minimizing this quantity with respect to under
the condition that they sum to unity. The corresponding Lagrange function is
p
X
L = n> 1
n
2(
i 1)
n
i=1
= (g
M)> 1
n (g
p
X
M) 2(
i 1)
i=1
Solving the set of equations

L
=0
L
=0
we obtain the solution

1
= (M> 1
(M> 1
n M)
n g 1p )
1p = 1,
(11.1)
where 1p = (1, 1 . . . 1)> . The first equation determines the mixing coefficients in terms of
known quantities and . The second equation can be used to eliminate .
11.1.2
Unconstrained linear unmixing
If we work with MNF-projected data (see next section) then we can assume that n = 2 I.
If furthermore we ignore the constraint on (i.e. = 0), then (11.1) reduces to
= [(M> M)1 M> ]g.
The expression in square brackets is the pseudoinverse of the matrix M, see Chapter 1.
11.1.3
Intrinsic end-members and pixel purity
If a spectral library for all of the p end-members in M is available, the mixture coefficients
can be calculated directly. The primary result of the spectral mixture analysis is the fraction
120
images which show the spatial distribution and abundance of the end-member components
in the scene.
If such external data are unavailable, there are various strategies for determining endmembers from the hyperspectral imagery itself. We describe briefly the method recommended in ENVI and implemented in the so-called Spectral Hourglass Wizard.
The first step is to reduce the dimensionality of the data. This is done with the MNF
transformation described in Chapter 3. By examining the eigenvalues of the transformation
and retaining only the components with eigenvalues exceeding one (non-noise components),
the number of dimensions can be reduced substantially, see Figure 11.3.
Figure 11.3: Eigenvalues of the MNF transformation of the image in Figure 11.1.
The so-called pixel purity index (PPI) is then used to find the most spectrally pure, or
extreme, pixels in the remaining data. The most spectrally pure pixels typically correspond
to mixing end-members. The PPI is computed by repeatedly projecting n-dimensional
scatter plots onto a random unit vector. The extreme pixels in each projection are noted
and the number of times each pixel is marked as extreme is recorded. The purest pixels must
must be on the corners, edges or faces of the data cloud. A threshold value is used to define
how many pixels are marked as extreme at the ends of the projected vector. This value
should be 2-3 times the noise level in the data, which is 1 when using the MNF transformed
channels. A minimum of about 5000 iterations is usually required to produce useful results.
When the iterations are completed, a PPI image is created in which the value of each
pixel corresponds to the number of times that pixel was recorded as extreme. So bright
pixels are generally end-members. This image hints at locations and sites that could be
visited for ground truth measurements.
The n-dimensional visualizer, Figure 11.4 can then be used interactively to define classes
of pixels corresponding to end-members and to plot their spectra. These can be saved along
with their pixel locations as ROIs (regions of interest) for later use in spectral unmixing.
This method is repeatable and has the advantage of objectivity in analysis of a data
set to assess dimensionality and define end-members. The primary disadvantage is that it
is a statistical approach dependent upon the specific spectral variance of the image. Thus
the resulting end-members are mathematical constructs which may not be physically interpretable.
11.2. ORTHOGONAL SUBSPACE PROJECTION
121
Figure 11.4: The n-D visualizer.
11.2
Orthogonal subspace projection
Orthogonal subspace projection is a transformation which is closely related to linear unmixing. Suppose that a multispectral image pixel g consists of a mixture of desirable and
undesirable spectra,
g = D + U + n.
The ` ` matrix
P = I U(U> U)1 U>
projects out the undesirable components, since

Pg = PD + IU U + Pn = PD + Pn.
(11.2)
An example of the use of this transformation is the suppression of cloud cover from a
multispectral image. First an unsupervised classification is carried out (see Chapter 9) and
the clusters containing the undesired features (clouds) are identified. The mean vectors of
these clusters can then be used as the undesired spectra and combined to form the matrix
U. The the projection (11.2) can be applied to the entire image.
Here is an ENVI/IDL program to implement this idea:
; Orthogonal subspace projection
pro osp, event
print, ---------------------------------
print, Orthogonal Subspace Projection
print, systime(0)
122
print, ---------------------------------
infile=dialog_pickfile(filter=*.dat,/read) ; read in cluster centers
openr,lun,infile,/get_lun
; number of spectral channels
readf,lun,num_channels
readf,lun,K
; number of cluster centers
Ms=fltarr(num_channels,K)
readf,lun,Ms
Us=transpose(Ms)
print,Cluster centers (in the columns)
print,Us
centers=indgen(K)
print,enter undesired centers as 1 (e.g. 0 1 1 0 0 ...)
read,centers
U = Us[where(centers),*]
print,Subspace U
print,U
Identity = fltarr(num_channels,num_channels)
for i=0,num_channels-1 do Identity[i,i]=1.0
P = Identity - U##invert(transpose(U)##U,/double)##transpose(U)
print,projection matrix:
print, P
envi_select, title=Choose multispectral image for projection, $
fid=fid, dims=dims,pos=pos
if (fid eq -1) then goto, done
num_cols = dims[2]+1
num_lines = dims[4]+1
num_pixels = (num_cols*num_lines)
if (num_channels ne n_elements(pos)) then begin
print,image dimensions are incorrect, aborting ...
goto, done
end
image=fltarr(num_pixels,num_channels)
for i=0,num_channels-1 do $
image[*,i]=envi_get_data(fid=fid,dims=dims,pos=pos[i])+0.0
print,projecting ...
; do the projection
image = P ## image
out_array = bytarr(num_cols,num_lines,num_channels)
for i = 0,num_channels-1 do out_array[*,*,i] = $
bytscl(reform(image[*,i],num_cols,num_lines,/overwrite))
base = widget_auto_base(title=OSP Output)
11.2. ORTHOGONAL SUBSPACE PROJECTION

sb = widget_base(base, /row, /frame)
wp = widget_outfm(sb, uvalue=outf, /auto)
result = auto_wid_mng(base)
if (result.accept eq 0) then begin
print, Output cancelled
goto, done
endif
if (result.outf.in_memory eq 1) then begin
envi_enter_data, out_array
print, Result written to memory
endif else begin
openw, unit, result.outf.name, /get_lun
band_names=strarr(num_channels)
for i=0,num_channels-1 do begin
writeu, unit, out_array[*,*,i]
band_names[i]=OSP component + string(i+1)
endfor
envi_setup_head ,fname=result.outf.name, ns=num_samples, $
nl=num_lines, nb=num_channels $
,data_type=1, interleave=0, /write $
,bnames=band_names $
,descrip=OSP
print, File created , result.outf.name
close, unit
endelse
done: print,done.
end
123
124
Appendix A
Least Squares Procedures

A.1
Generalized least squares
Consider the following data model:

y = a> x +
relating n independent variables x = (x1 . . . xn )> to a measured quantity y via the parameters a = (a1 . . . an )> . The random variable represents measurement uncertainty, and we
assume
var() = 2 .
We wish to determine the best values for parameters a. If we perform m > n measurements, we can write
n
X
aj (xj )1 +
y1 =
j=1
..
.
(A.1)
ym =
n
X
aj (xj )m + .
j=1
Defining the m n matrix A by

(A)ij = (xj )i
we can write (A.1) as
y = Aa +
(A.2)
where y = (y1 . . . ym )> and = ( . . . )> and

= h> i = 2 I.
The goodness of fit function is
2
"
#2
Pn
m
X
yi j=1 Aij aj
i=1
125
126
APPENDIX A. LEAST SQUARES PROCEDURES
This is minimized by solving the equations

2
= 0,
ak
We obtain
m
X
i=1
y i
n
X
k = 1 . . . n.
Aij aj Aik = 0,
k = 1 . . . n,
j=1
which we can write in matrix form as

A> y = (A> A)a .
(A.3)
Eq. (A.3) is referred to as the normal equation. The fitted parameters of the model are thus
estimated by
= (A> A)1 A> y =: Ly.
a
(A.4)
The matrix
L = (A> A)1 A>
is called the pseudoinverse of A.

, the uncertainties in
Thinking now of a as a random variable with expectation value a
the fitted parameters can be obtained as follows:
)(a a
)> i
= h(a a
= h(a Ly)(a Ly)> i
= h(a L(Aa + ))(a L(Aa + ))> i
But LA = I, so we have
= h(L)(L)> i
= Lh> iL>
(A.5)
= 2 LL>
= 2 (A> A)1 .
To check that this is indeed a generalization of the simple linear regression, identify the
parameter vector a with the straight line parameters a and b, i.e.

a1
a
a=
=
.
a2
b
The matrix A and vector y are similarly
1
1
A=
...
1
x1
x2
,
..
.
xm
y1
y2
y=
.. .
.
ym
Thus the best estimates for the parameters are

a
= = (A> A)1 (A> y).

a
b
A.2. RECURSIVE LEAST SQUARES

Evaluating:
>
(A A)

=
Pm
xi
127
1
P 1
m P
m
x
P x2i
=
.
xi
m
x
x2i
Recalling the expression for the inverse of a 2 2 matrix,

P 2
1
xi
>
1
(A A) = P 2
x
m xi + m2 x
2 m
Furthermore, we have
>
A y=
y
Pm
x i yi
m
x
m

.

.
Therefore the estimate for b is

b =
P
X
m
xy + xi yi
1
2
P
(m
x
+
m
x
y
)
=
.
i i
x2i + m2 x
2
m x2i + m2 x
2
(A.6)
From (A.3) the uncertainty in b is given by 2 times the (2,2) element of (A> A)1 ,
b2 = 2
m
.
+ m2 x
2
x2i
(A.7)
Equations (A.6) and (A.7) correspond to those for ordinary least squares.
A.2
Recursive least squares
Suppose that the measurement data in (A.1) are presented sequentially and we wish to
determine the best solution for the parameters a as the new data become available. We can
write Eq. (A.2) in the form
(A.8)
y ` = A` a +
indicating that ` measurements have been made up till now (we assume ` > n), where as
before n is the number of parameters (the length of a). The least squares solution is, with
(A.4),
1 >
= (A>
a
A` y` =: a(`)
` A` )
and, from (A.5), the covariance matrix of a(`) is
1
` = (A>
.
` A` )
(A.9)
We have assumed for convenience that 2 = 1. Therefore we can write

a(`) = ` A>
` y` .
(A.10)
Suppose a new observation becomes available. (Well call it (x(` + 1), y(` + 1)) rather
than (x`+1 , y`+1 ), as this simplifies the notation considerably.) Now we must solve the least
squares problem

A`
y`
=
a + ,
y(` + 1)
A(` + 1)
where A(` + 1) = x(` + 1)> . According to (A.10) the solution is
>

A`
y`
a(` + 1) = `+1
.
A(` + 1)
y(` + 1)
(A.11)
128
From (A.9) we can obtain a recursive formula for the covariance matrix `+1 :

>

A`
A`
1
>
`+1 =
= A>
` A` + A`+1 A`+1
A(` + 1)
A(` + 1)
or
1
>
1
`+1 = ` + A(` + 1) A(` + 1).
(A.12)
Next we multiply Eq. (A.11) out,

>
a(` + 1) = `+1 (A>
` y` + A(` + 1) y(` + 1)),
and replace y` with A` a(`) to obtain

>
a(` + 1) = `+1 (A>
` A` a(`) + A(` + 1) y(` + 1)).
Using (A.9) and (A.12),

>
a(` + 1) = `+1 (1
` a(`) + A(` + 1) y(` + 1))
1

= `+1 `+1 a(`) A(` + 1)> A(` + 1)a(`) + A(` + 1)> y(` + 1) .
This simplifies to

a(` + 1) = a(`) + `+1 A(` + 1)> y(` + 1) A(` + 1)a(`) .
Finally, with the definition of the Kalman gain
K`+1 := `+1 A(` + 1)> ,
(A.13)
we also obtain a recursive equation for the parameter vector a, namely

a(` + 1) = a(`) + K`+1 y(` + 1) A(` + 1)a(`) .
(A.14)
Equations (A.12A.14) define a so-called Kalman filter for the least squares problem
(A.8). For input x(` + 1) = A(` + 1) the system response A(` + 1)a(`) is calculated in
(A.14) and compared with the measurement y(` + 1). Then the innovation, that is to say
the difference between the measurement and system response, is multiplied by the Kalman
gain determined by (A.13) and (A.12) and the old value a(`) is corrected accordingly.
Relation (A.12) is inconvenient as it calculates the inverse of the covariance matrix `+1
whereas we require the non-inverted form in order to determine the Kalman gain (A.13).
Fortunately (A.12) and (A.13) can be reformed as follows:

`+1 = I K`+1 A(` + 1) `

1
K`+1 = ` A(` + 1)> A(` + 1)` A(` + 1)> + 1 .
(A.15)
A.3. ORTHOGONAL REGRESSION
129
To see this, first of all note that the second equation in (A.15) is a consequence of the
first equation and (A.13). Therefore it suffices to show that the first equation is indeed the
inverse of (A.12):

1
`+1 1
`+1 = I K`+1 A(` + 1) ` `+1

= I K`+1 A(` + 1) + I K`+1 A(` + 1) ` A(` + 1)> A(` + 1)
= I K`+1 A(` + 1) + ` A(` + 1)> A(` + 1) K`+1 A(` + 1)` A(` + 1)> A(` + 1).
The second equality above follows from (A.12). But from the second equation in (A.15) we
have
K`+1 A(` + 1)` A(` + 1)> = ` A(` + 1)> K`+1
and therefore
>
>
`+1 1
`+1 = I K`+1 A(` + 1) + ` A(` + 1) A(` + 1) (` A(` + 1) K`+1 )A(` + 1) = I
as required.
A.3
Orthogonal regression
In the model for ordinary least squares regression the xs are assumed to be error-free. In
the calibration case where it is arbitrary what we call the reference variable and what we
call the uncalibrated variable to be normalized, we should allow for error in both x and y.
If we impose the model1
yi i = a + b(xi i ), i = 1 . . . m
(A.16)
with and as uncorrelated, white, Gaussian noise terms with mean zero and equal variances
2 , we get for the estimator of b, [KS79],
q
(s2yy s2xx ) + (s2yy s2xx )2 + 4s2xy
b =
(A.17)
2sxy
with
1 X
(yi y)2
m i=1
n
s2yy =
(A.18)
and the remaining quantities defined in the section immediately above. The estimator for a
is
a
= y b
x.
(A.19)
According to [Pat77, Bil89] we get for the dispersion matrix of the vector (
a, b)>

2b(1 + b2 ) x
x(1 + )
2 (1 + ) + sxy /b
x(1 + )
1 +
msxy
with
=
1 The
2b
(1 + b2 )sxy
(A.20)
(A.21)
model in equation (A.16) is often referred to as a linear functional relationship in the literature.
130
and where 2 can be replaced by
2 =
m
(n 2)(1 + b2 )
(s2yy 2bsxy + b2 s2xx ),
(A.22)
see [KS79].
It can be shown that estimators of a and b can be calculated by means of the elements
in the eigenvector corresponding to the smallest eigenvalue of the dispersion matrix of the
m by 2 data matrix with a vector of the xs in the first column and a vector of the ys in
the second column, [KS79]. This can be used to perform orthogonal regression in higher
dimensions, i.e., when we have, for example, more x variables than the one variable we have
here.
Appendix B
The Discrete Wavelet

Transformation
The following discussion follows [AS99] closely.
B.1
Inner product space
Let f and g be two functions of the real numbers IR and define their inner product as
Z
hf, gi =
f (t)g(t)dt.
The inner product space L2 (IR) is the collection of all functions f : IR IR such that
Z
kf k = hf, f i
1/2
1/2
f (t) dt
< .
The distance between two functions f (t) and g(t) in L2 (IR) is

Z
d(f, g) = kf gk =
B.2
1/2
(f (t) g(t)) dt
2
Haar wavelets
Let Vn be the collection of all piecewise constant functions of finite extent1 that have possible
discontinuities at the rational points m 2n , where m and n are integers, m, n Z. Then
all members of Vn belong to the inner product space L2 (IR),
Vn L2 (IR).
Define the the Haar scaling function according to
n
1 if 0 t 1
.
(t) =
0 otherwise
131
(B.1)
132
APPENDIX B. THE DISCRETE WAVELET TRANSFORMATION
Figure B.1: The Haar scaling function.
It is shown in Figure B.1.

Any function in Vn in [0, 1] can be expressed as a linear combination of the standard
Haar basis functions of the form
Cn = {n,k (t) = (2n t k) | k = 0, 1 . . . 2n 1}.
These basis functions have compact support and are orthogonal in the following sense:
hn,k , n,k0 i =
1
k,k0 .
2n
Note that 0,0 (t) = (t).

Consider the function spaces V0 and V1 with orthogonal bases {0,0 } and {1,0 , 1,1 },
respectively. According to the orthogonal decomposition theorem [?], any function in V1 can
be projected onto basis functions 0,0 (t) for V0 plus a residual in the space V0 which is
orthogonal to V0 . Formally,
V1 = V0 V0 .
For example
1,0 (t) =
h1,0 , 0,0 i
0,0 (t) + r(t).
h0,0 , 0,0 i
The residual function r(t) is in the residual space V0 . We see that

1
1
r(t) = 1,0 (t) 0,0 (t) = (2t) (t)
2
2
1
1
= (2t) [(2t) + (2t 1)] = (t)
2
2
where (t) is the Haar wavelet derived from the scaling function according to
(t) = (2t) (2t 1).
1 Such
functions are said to have compact support.
(B.2)
B.2. HAAR WAVELETS
133
Thus an alternative basis for V1 is

B1 = {(t), (t)} = {0,0 (t), 0,0 (t)}.
The wavelet 0,0 (t) is shown in Figure B.2
Figure B.2: The Haar wavelet.

We can repeat this argument for V2 = V1 V1 to obtain the basis
B2 = {0,0 (t), 0,0 (t), 1,0 (t), 1,1 (t)},
where now {1,0 (t), 1,1 (t)} is an orthogonal basis for Vi . In general, the Haar wavelet
basis for Vn is
Bn = {0,0 (t), 0,0 (t), 1,0 (t), 1,1 (t) . . . n1,0 (t), n1,1 (t) . . . n1,2n 1 (t)},
where
{m,k (t) = (2m t k) | k = 0 . . . 2n 1}
is an orthogonal basis for Vm , and
Vn = Vn1 Vn1
= V0 V0 . . . Vn2
Vn1
.
In the case of the Haar wavelets, there

is a simple correspondence between the basis
n
functions (, ) and the vector space IR2 , i.e. the space of 2n -dimensional vectors. Consider
for instance n = 2. Then the correspondence is

1
1
0
0
0
1
0
1
0
0
0,0 = , 2,0 = , 2,1 = , 2,2 = , 2,3 = ,
1
0
0
1
0
1
0
0
0
1
134
and
0,0
1
1
0
1
1
0
=
, 1,0 =
, 1,1 =
.
1
0
1
1
0
1
Thus the orthogonal basis B2 can be represented by the mutually orthogonal vectors

1
1
0
1
1 1 1 0
B2 = ,
.
,
,
1
0
1
1
1
0
1
1
Example: signal compression
We consider the continuous function f (t) = sin(20t)(log t)2 sampled at 64 evenly spaced
points on the interval [0, 1]. The 64 samples comprise a signal vector
f = (f0 , f1 . . . f63 )> = (f (0/63), f (1/63) . . . f (63/63))>
and can also be thought of as a piecewise constant function f(t) belonging to the function
space V6 . The function is shown in Figure B.3.
Figure B.3: The function sin(20t)(log x)2 sampled at 64 points on [0, 1].
We can express the function f(t) in the basis C6 as follows:
f = f0 6,0 (t) + f1 6,1 (t) + . . . f63 6,63 ,

where we think of the basis functions as vectors and where fi is the value of the function
sampled at point i, i = 0 . . . 63. Alternatively the signal can be written in the vector basis
B6 ,
f = w0 0,0 (t) + w1 0,0 (t) + . . . + w63 5,31 (t).

We can write this equivalently as the matrix equation
f = B6 w.
(B.3)
B.2. HAAR WAVELETS

where B6 is a (64 64)-dimensional
here, but
1 1
1 1
1 1
1 1
B3 =
1 1
1 1
1 1
1 1
135
matrix of ones and zeroes. This is too large to show
1
1
1
1
0
0
0
0
1
0
0
0 1 0
0
0
1
0
0 1
1
0
0
1
0
0
1 0
0
1 0
0
0
0
0
0
0
0
0
0
,
1
0
1 0
0
1
0 1
for example. The elements of the vector w comprise the wavelet coefficients. They are given
by the wavelet transform
w = B1
6 f.
The wavelet coefficients are thus an alternative way of representing the original signal f(t).
They are plotted in Figure B.4
Figure B.4: The wavelet coefficients w for the signal in Figure B.3.
Notice that many of the coefficients are close to zero. We can define a threshold below
which all coefficients are set exactly to zero. This generally leads to long series of zeroes in
w, so that it can be compressed efficiently,
w wcompr .
Figure B.5 shows the result of reconstructing the signal according to
f = B6 wcompr
after setting a threshold of 0.1. In all, 33 of the 64 wavelet coefficients are zero after
thresholding.
136
Figure B.5: The reconstructed signal after thresholding at 0.1.

The following program illustrates the above steps in IDL:
; generate a signal vector
t=findgen(64)/63
f=sin(20*t)*alog(t)*alog(t)
f[0]=0
thisDevice =!D.Name
set_plot, PS
Device, Filename=c:\temp\signal.eps,xsize=15,ysize=10,/Encapsulated
plot,t,f,color=1,background=FFFFFFXL
device,/close_file
; read in basis B6
filename = Dialog_Pickfile(Filter=*.dat,/Read)
openR,lun,filename,/get_lun
B6 = fltarr(64,64)
ReadF, lun, B6
; do the wavelet transform and display
w=invert(B6)##f
Device, Filename=c:\temp\wavelet_coeff.eps,xsize=15,ysize=10,/Encapsulated
plot,t,w,color=1,background=FFFFFFXL
; display "compressed" signal
w( where(abs(w) lt 0.1) )=0.0
Device, Filename=c:\temp\recon_signal.eps,xsize=15,ysize=10,/Encapsulated
plot,t,w,color=1,background=FFFFFFXL
device,/close_file
end
B.3. MULTI-RESOLUTION ANALYSIS
B.3
137
Multi-resolution analysis
So far we have considered only functions on the interval [0, 1] with basis functions n,k (t) =
(2n t k), k = 1 . . . 2n 1. We can extend this to functions defined on all real numbers IR
in a straightforward way. For example
{(t k) | k Z}
is a basis for the space V0 of all piecewise constant functions with compact support (finite
extent) having possible breaks at integer values. More generally, a basis for the set Vn of
piecewise constant functions with possible breaks at m 2n and compact support is
{(2n t k) | k Z}.
We can even allow n < 0. For example n = 1 means that the possible breaks are at even
integer values.
We can think of the collection of nested subspaces of piecewise constant functions
. . . V1 V0 V1 V2 . . . L2 (IR),
as being generated by the Haar scaling function . This collection is called a multiresolution
analysis (MRA). A general MRA must have the following properties:
S
1. V = nZ Vn is dense in L2 (IR), that is, for any function f L2 (IR) there exists a
series of functions, one in each Vn , which converges to f . This is true of the Haar
MRA, see Figure 2.7 for example.
T
2. The separation property: I = nZ Vn = {0}. For the Haar MRA, this means that
any function in I must be piecewise constant on all intervals. The only function in
L2 (IR) with this property and compact support is f (t) = 0, so the separation property
is satisfied.
3. The function f (t) Vn if and only if f (2n t) V0 . In the Haar MRA, if f (t) V1
then it is piecewise constant on intervals of length 1/2. Therefore the function f (21 t)
is piecewise constant on intervals of length 1, that is f (21 t) V0 , etc.
4. The scaling function is an orthonormal basis for the function space V0 , i.e. h(t
k), (t k 0 )i = kk0 . This is of course the case for the Haar scaling function.
In the following, we will think of (t) as any scaling function which generates an MRA
in the above sense. Since {(t k) | k Z} is an orthonormal basis for V0 , it follows that
{(2t k) | k Z} is an orthogonal basis for V1 . That is, let f (t) V1 . Then by property
3, f (t/2) V0 and
X
X
f (t/2) =
ak (t k) f (t) =
ak (2t k).
k
In particular, since (t) V0 V1 , we have the dilation equation

(t) =
X
k
ck (2t k).
(B.4)
138
The constants ck are called the refinement coefficients. For example, the dilation equation
for the Haar wavelets is
(t) = (2t) + (2t 1)
so that the refinement coefficients are c0 = c1 = 1, ck = 0 otherwise.
Note that c20 + c21 = 2. It is easy to show that this is a general property of the refinement
coefficients:
X
X
1X 2
ck (2t k),
ck0 (2t k 0 )i =
ck .
1 = h(t), (t)i = h
2
0
k
Therefore,
c2k = 2,
(B.5)
k=
which is also called Parsevals formula. In a similar way it is easy to show that
ck ck2j = 0, j 6= 0.
(B.6)
k=
B.4
Fixpoint wavelet approximation
There are many other possible scaling functions that define or generate a MRA. Some of
these cannot be expressed as simple, analytical functions. But once we have the refinement
coefficients for a scaling function, we can approximate that scaling function to any desired
degree of accuracy using the dilation equation. (In fact we can work with a MRA even
when there is no simple analytical representation for the scaling function which generates
it.) The idea is to iterate the refinement equation with a so-called fixpoint algorithm until
it converges to a sequence of points which approximates (t).
Let F be the function that assigns the expression
X
cn (2t n)
F ()(t) =
n
to any function (t), where cn are refinement coefficients. Applying F to the Haar scaling
function:
X
F ()(t) =
cn (2t n) = (t)
n
where the second equality follows from the dilation equation. Thus is a fixpoint of F .
The following recursive scheme can be used to estimate a scaling function with up to
four refinement coefficients:
f0 (t) = t,0
fi (t) = c0 fi1 (2t) + c1 fi1 (2t 1) + c2 fi1 (2t 2) + c3 fi1 (2t 3).
In this scheme, t takes on values of the form m 2n , m, n Z, only. The first definition is
the termination condition for the recursion and approximates the scaling function to zeroth
order as the Dirac delta function. The second relation defines the ith approximation to the
scaling function in terms of the (i 1)th approximation using the dilation equation. We can
calculate the set

j
n
j
=
0
.
.
.
3(2
)
, n 1,
fn

2n
B.4. FIXPOINT WAVELET APPROXIMATION
139
as a pointwise approximation of . Here is a recursive IDL program to approximate any

scaling function with 4 refinement coefficients:
ffunction f, t, i
common refinement, c0,c1,c2,c3
if (i eq 0) then if (t eq 0) then return, 1.0 else return, 0.0 $
else return, c0*f(2*t,i-1)+c1*f(2*t-1,i-1)+c2*f(2*t-2,i-1)+c3*f(2*t-3,i-1)
end
common refinement, c0,c1,c2,c3
; refinement coefficients for Haar scaling function
;
c0=1 & c1=1 & c2=0 & c3=0
; refinement coefficients for Daubechies scaling function
c0=(1+sqrt(3))/4 & c1=(3+sqrt(3))/4 & c2=(3-sqrt(3))/4 & c3=(1-sqrt(3))/4
; fourth order approximation
n=4
t = findgen(3*2^n)
ff=fltarr(3*2^n)
for i=0,3*2^n-1 do ff[i]=f(t[i]/2^n,n)
thisDevice =!D.Name
set_plot, PS
Device, Filename=c:\temp\daubechies_approx.eps,xsize=15,ysize=10,/Encapsulated
plot,t/2^n,ff,yrange=[-1,2],color=1,background=FFFFFFXL
device,/close_file
set_plot,thisDevice
end
Figure B.6: The fixpoint approximation of the Haar scaling function to order n = 4.
Figure B.6 shows the result of n=4 iterations using the refinement coefficients c0 = c1 =
1, c2 = c3 = 0 for the Haar scaling function.
140
B.5
The mother wavelet
Let f be a signal or function, f L2 (IR), and let Pn (f ) denote its projection onto the space
Vn . We saw in the case of the Haar MRA that we can always write
X hf, n,k i
n,k .
Pn+1 (f ) = Pn (f ) +
hn,k , n,k i
k
The Haar wavelet

n,k (t) = (2n t k), k, n Z,
was seen to be an orthogonal basis for Vn and
(t) = (2t) (2t 1),
(B.7)
where is the scaling function. It can in fact be shown that this is always the case for any
MRA, except that the last expression relating the mother wavelet to the scaling function
is generalized.
Consider now some MRA with a normalized scaling function defined (in the sense of
the preceding section) by the dilation equation (B.4). Since
1
1
h(t), (t)i = ,
2
2
h(2t k), (2t k)i =

the functions
2(2t k) are normalized and orthogonal. We write (B.4) in the form

X
(t) =
hk 2(2t k),
(B.8)
k
where
ck
hk = .
2
It follows from (B.8) that
h2k = 1.
(B.9)
Now we assume, in analogy to (B.8), that can be expressed in terms of the scaling function
as
X
gk 2(2t k).
(B.10)
(t) =
k
Since V0 and V0 we have

h, i =
hk gk = 0.
(B.11)
Similarly,
h(t k), (t m)i =
gi gi2(km) = k,m .
(B.12)
A set of coefficients that satisfies (B.11) and (B.12) is given by

gk = (1)k h1k ,
so we obtain, finally, the relationship between the wavelets and the scaling function:
X
X
(t) =
(1)k h1k 2(2t k) =
(1)k c1k (2t k).
(B.13)
k
B.6. THE DAUBECHIES WAVELET
B.6
141
The Daubechies wavelet
The Daubechies scaling function is derived according to the following two requirements on
an MRA:
1. Compact support: The scaling function (t) is required to be zero outside the interval
0 < t < 3. This means that the refinement coefficients ck vanish for k < 0, k > 3. To see
this, note that
Z
3
c3 = 2h(t), (2t + 3)i =
(t)(2t + 3)dt = 0
0
and similarly for k = 4, 5 . . . and for k = 6, 7 . . .. Therefore, from the dilation equation,
(1/2) = 0 = c2 (1 + 2) + c1 (1 + 1) + . . . c2 = 0
and similarly for k = 1, 4, 5.
Thus from (B.5), we can conclude that
c20 + c21 + c22 + c23 = 2
(B.14)
c0 c2 + c1 c3 = 0.
(B.15)
and from (B.6), with j = 1, that

In addition, again from the dilation equation, we can write
Z
Z
Z
3
3
X
X
ck
(t)dt =
ck
(2t k)dt =
(u)du.
2
k=0
k=0
R
But one can show that an MRA implies (t)dt 6= 0 so we have
c0 + c1 + c2 + c3 = 2.
(B.16)
2. Regularity: All constant and linear polynomials can be written as a linear combination of
the basis {(t k) | k Z} for V0 . This implies that there is no residual in the orthogonal
decomposition of f (t) = 1 and f (t) = t onto the basis, that is,
Z
Z
(t)dt =
t(t)dt = 0.
(B.17)
With (B.13) the mother wavelet is

(t) = c0 (2t 1) + c1 (2t) c2 (2t + 1) + c3 (2t + 2).
(B.18)
The first requirement in (B.17) gives immediately

c0 + c1 c2 + c3 = 0.
From the second requirement we have
Z
Z
3
X
t(t)dt = 0 =
(1)k+1 ck
k=0
3
X
t(2t 1 + k)dt
u+1k
(u)du
4
k=0
Z
Z
0
c0 + c2 2c3
=
u(u)du +
(u)du,
4
4
(1)k+1 ck
(B.19)
142
using (B.19). Thus

c0 + c2 2c3 = 0.
(B.20)
Equations (B.14), (B.15), (B.16), (B.19) and (B.20) comprise a system of five equations in
four unknowns. A solution is given by
1+ 3
3+ 3
3 3
1 3
c0 =
, c1 =
, c2 =
, c3 =
,
4
4
4
4
which are known as the D4 refinement coefficients. Figure B.7 shows the corresponding
scaling function, determined with the fixpoint method described earlier.
Figure B.7: The fixpoint approximation of the Daubechies scaling function to order n = 4.
Example: image compression

The following program, adapted from the IDL Reference Guide, uses the Daubechies D4
wavelet to compress a gray-scale satellite image. It displays the original and compressed
images and determines the file size of the compressed image. The next section discusses the
implementation of the wavelet transformation in terms of a filter bank.
; Image compression with Daubechies wavelet
; read a bitmap image and cut out a 512x512 pixel array
cd, c:\idl\projects\image analysis
filename = Dialog_Pickfile(Filter=*.bmp,/Read)
image1 = Read_BMP(filename)
image = bytarr(512,512)
image[*,*] = image1[1,0:511,0:511]
; display cutout and size
window,0,xsize=512,ysize=512
B.7. WAVELETS AND FILTER BANKS
143
wset, 0
tv, bytscl(image)
print, Size of original image is, 512*512L, bytes
; perform wavelet transform with D4 wavlet
wtn_image = wtn(image, 4)
; convert to sparse array with threshold 20 and write to disk
sparse_image = sprsin(wtn_image,thresh=20)
write_spr, sparse_image, sparse.dat
openr, 1, sparse.dat
status = fstat(1)
close, 1
print, Size of compressed image is, status.size, bytes
; reconstruct full array, do inverse wavelet transform and display
wset,1
tv, bytscl(wtn(fulstr(sparse_image), 4, /inverse))
end
B.7
Wavelets and filter banks
In the case of the Haar wavelets we were able to carry out the wavelet transformation with
vectors and matrices. In general, we cant represent scaling functions in this way. In fact
usually all that we have to work with are the refinement coefficients. So how can we perform
the wavelet transformation? To answer this question, consider a row of pixels
(s(0), s(1) . . . s(m 1))
in a satellite image, where m = 2n , and the associated vector signal on [0, 1] given by
s = (s0 , s1 . . . sm1 )> = (s(0/(m 1)), s(1/(m 1)) . . . s(1))> .
In the MRA generated by a scaling function , such as D4 , this signal defines a function
fn (t) Vn on the interval [0, 1] according to
fn (t) =
m1
X
j=0
sj n,j =
m1
X
sj (2n t j).
(B.21)
j=0
Assume that the basis functions are appropriately normalized. The projection of fn (t) onto
Vn1 is then
X
m/21
fn1 (t) =
k=0
hfn , (2n1 t k)i(2n1 t k) =
m/21
(Hs)k (2n1 t k),
k=0
where

>
Hs = hfn , (2n1 t)i, hfn , (2n1 t 1)i . . . hfn , (2n1 t m/2 1)i
144
is the signal vector in Vn1 . The operator H is interpreted as a low-pass filter. It averages
the original signal s and reduces its length by a factor of two. We have, using (B.21),
(Hs)k =
m1
X
sj h(2n t j), (2n1 t k)i.
j=0
From the dilation equation with normalized basis functions,

X
(2n1 t k) =
hk0 (2n t 2k k 0 ),
k0
so we can write
(Hs)k =
m1
X
sj
j=1
Therefore
(Hs)k =
m1
X
hk0 h(2n t j), (2n t 2k k 0 )i
k0
j=1
m1
X
sj
hk0 j,k0 +2k .
k0
hj2k sj ,
k = 0...
j=0
m
1 = 2n1 1.
2
(B.22)
For the Daubechies scaling function,
1+ 3
3+ 3
3 3
1 3
, h1 =
, h2 =
, h3 =
, h4 = 0, . . . .
h0 =
4 2
4 2
4 2
4 2
Thus the elements of the filtered signal are
(Hs)0 = h0 s0 + h1 s1 + h2 s2 + h3 s3
(Hs)1 = h0 s2 + h1 s3 + h2 s4 + h3 s5
(Hs)3 = h0 s4 + h1 s5 + h2 s6 + h3 s7
..
.
This is just the convolution of the filter H = (h3 , h2 , h1 , h0 ) with the signal s,
Hs = H s,
see Eq. (2.12), except that only every second term is retained. This is referred to as
downsampling and is illustrated in Figure B.8.
In the same way, the high-pass filter G projects fn (t) onto the orthogonal subspace Vn1
according to
m1
X
m
gj2k sj , k = 0 . . .
1 = 2n1 1.
(B.23)
(Gs)k =
2
j=0
Recall that
gk = (1)k h1k
145
2
Hs
Figure B.8: Schematic representation of the filter H. The symbol 2 indicates downsampling
by a factor of two.
so that the nonzero high-pass filter coefficients are actually

g2 = h3 , g1 = h2 , g0 = h1 , g1 = h0 .
The concatenated signal
(Hs, Gs) = (s1 , d1 )
is the projection of fn onto Vn1 Vn1

. It has the same length as the original signal s and
is an alternative representation of that signal. Its generation is illustrated in Figure B.9 as
a filter bank.
s1
d1
Figure B.9: Schematic representation of the filter bank H, G.
The projections can be repeated on s1 = Hs to obtain the projection

(Hs1 , Gs1 , Gs) = (s2 , d2 , d1 )
onto Vn2 Vn2

Vn1
and so on.
The original signal can be reconstructed at any stage by applying the inverse operators
H and G . For the first stage these are defined by
m/21
(H s1 )k =
hk2j s1j ,
k = 0 . . . m 1 = 2n 1,
(B.24)
gk2j d1j ,
k = 0 . . . m 1 = 2n 1,
(B.25)
j=0
m/21
1
(G d )k =
j=0
with analagous definitions for the other stages. To understand whats happening, consider
146
the elements of the filtered signal (B.24). These are

(H s1 )0 = h0 s10
(H s1 )1 = h1 s10
(H s1 )2 = h2 s10 + h0 s11
(H s1 )3 = h3 s10 + h1 s11
(H s1 )4 = h2 s11 + h0 s12
(H s1 )5 = h3 s11 + h1 s12
..
.
This is just the convolution of the filter H = (h0 , h1 , h2 , h3 ) with the signal
s10 , 0, s11 , 0, s12 , 0 . . . s1m/21 , 0.
This is called the upsampled signal. The filter (B.24) can be represented schematically as in
Figure B.10.
s1
H s1
Figure B.10: Schematic representation of the filter H . The symbol 2 indicates upsampling
by a factor of two.
Equation (B.25) is interpreted in a similar way. Finally we add the two results to get
the original signal:
H s1 + G d1 = s.
To see this, write the equation out for a particular value of k:
X
m1
X
m/21
(H s1 )k + (G d1 )k =
hk2j
hj 0 2j sj 0 + gk2j
j 0 =0
j=0
m1
X
gj 0 2j sj 0
j 0 =0
Combining terms and interchanging the summations, we get

(H s1 )k + (G d1 )k =
m1
X
m/21
sj 0
j 0 =0
[hk2j hj 0 2j + gk2j gj 0 2j ].
j=0
Now, using gk = (1)k h1k ,

(H s1 )k + (G d1 )k =
m1
X
j 0 =0
m/21
sj 0
j=0
[hk2j hj 0 2j + (1)k+j h1k+2j h1j 0 +2j ].
147
With the help of (B.5) and (B.6) it is easy to show that the second summation above is just
j 0 k . For example, suppose k is even. Then
X
m/21
[hk2j hj 0 2j + (1)k+j h1k+2j h1j 0 +2j ] =
j=0
0
h0 hj 0 k + h2 hj 0 k+2 + (1)j [h1 h1j 0 +k + h3 h3j 0 +k ].

If j 0 = k, the right hand side reduces to
h20 + h21 + h22 + h23 = 1,
from (B.5) and hk = ck / 2. For any other value of j 0 , the expression is zero. Therefore we
can write
m1
X
(H s1 )k + (G d1 )k =
sj 0 j 0 k = sk ,
j 0 =0
as claimed. The reconstruction of the original signal from s1 and d1 is shown in Figure B.11
as a synthesis bank.
s1
d1
Figure B.11: Schematic representation of the synthesis bank H , G .

The extension of the procedure to two-dimensional signals (e.g. satellite imagery) is
straightforward, see [Mal89]. Figure B.12 shows a single application of the filters H and G
to the rows and columns of a satellite image. The image is a signal which defines a twodimensional function f in V10 V10 . The Daubechies D4 refinement coefficients are used to
generate the filters. The result of the low-pass filter is in the upper left hand quadrant. This
is the projection of f onto V9 V9 . The other three quadrants represent the projections
onto the orthogonal subspaces V9 V9 , V9 V9 and V9 V9 . The following IDL program
illustrates the procedure.
; the Daubechies kernels
H = [1-sqrt(3),3-sqrt(3),3+sqrt(3),1+sqrt(3)]/(4*sqrt(2))
G = [-H[0],H[1],-H[2],H[3]]
; arrays for wavelet coefficients
f0 = fltarr(512,512)
f1 = fltarr(512,256)
g1 = fltarr(512,256)
ff1 = fltarr(256,256)
148
Figure B.12: Wavelet projection of a satellite image with (2.36) and(2.37).
fg1 = fltarr(256,256)
gf1 = fltarr(256,256)
gg1 = fltarr(256,256)
; read a bitmap image and cut out a 512x512 pixel array
filename = Dialog_Pickfile(Filter=*.bmp,/Read)
image = Read_BMP(filename)
; 24 bit image, so get first layer
f0[*,*] = image[1,0:511,0:511]
; display cutout
wset, 0
tv, bytscl(f0)
; filter columns and downsample
ds = findgen(256)*2
for i=0,511 do begin
temp = convol(transpose(f0[i,*]),H,center=0,/edge_wrap)
f1[i,*] = temp[ds]
temp = convol(transpose(f0[i,*]),G,center=0,/edge_wrap)

g1[i,*] = temp[ds]
endfor
; filter rows and downsample
for i=0,255 do begin
temp = convol(f1[*,i],H,center=0,/edge_wrap)
ff1[*,i] = temp[ds]
temp = convol(f1[*,i],G,center=0,/edge_wrap)
fg1[*,i] = temp[ds]
temp = convol(g1[*,i],H,center=0,/edge_wrap)
gf1[*,i] = temp[ds]
temp = convol(g1[*,i],G,center=0,/edge_wrap)
gg1[*,i] = temp[ds]
endfor
f0[0:255,256:511]=bytscl(ff1[*,*])
f0[0:255,0:255]=bytscl(gf1[*,*])
f0[256:511,0:255]=bytscl(gg1[*,*])
f0[256:511,256:511]=bytscl(fg1[*,*])
thisDevice =!D.Name
set_plot, PS
Device, Filename=c:\temp\pyramid.eps,xsize=10,ysize=10,/Encapsulated
tv,f0
device,/close_file
set_plot,thisDevice
end
149
150
Appendix C
Advanced Neural Network

Training Algorithms
The standard backpropagation algorithm introduced in Chapter 10 is notoriously slow to
converge. In this Appendix we will develop two additional training algorithms for the
two-layer, feed-froward neural network of Figure 10.4. The first of these, scaled conjugate
gradient makes use of the second derivatives of the cost function with respect to the synaptic
weights, i.e. the Hessian matrix. The second, the Kalman filter method, takes advantage of
the statistical properties of the weight parameters. Both techniques are considerably more
efficient than backpropagation.
C.1
The Hessian matrix
The Hessian matrix was introduced in Chapter 10 as

Hij =
2 E(w)
.
wi wj
(C.1)
It is the (symmetric) matrix of second order partial derivatives of the cost function E(w)
with respect to the synaptic weights, the latter thought of as a single column vector
h
w1
..
.
h
w
Lo
w=
w1
.
..
o
wM
of length nw = L(N + 1) + M (L + 1). Since H is symmetric, its eigenvectors ui , i = 1 . . . nw

are orthogonal and any vector v in the space of the synaptic weights can be expressed as a
linear combination of them, e.g.
nw
X
i ui .
v=
i=1
151
152
APPENDIX C. ADVANCED NEURAL NETWORK TRAINING ALGORITHMS
But then we have
v> Hv = v>
i i ui =
i2 i ,
and we conclude that H is positive definite if and only of all of its eigenvalues i are positive.
Thus a good way to check if one is at or near a local minimum in the cost function is to
examine the eigenvalues of the Hessian.
The scaled conjugate gradient algorithm makes explicit use of the Hessian matrix for
more efficient convergence to a minimum in the cost function. The disadvantage of using H
is that it is difficult to compute efficiently. For example, for a typical classification problem
with N = 3-dimensional input data, L = 8 hidden neurons and M = 12 land use categories,
there are
[L(N + 1) + M (L + 1)]2 = 19, 600
matrix elements to determine at each iteration. We develop in the following an efficient
method to calculate not H directly, but rather the product v> H for any vector v having
nw components. Our approach follows Bishop [Bis95] closely.
C.1.1
The R-operator
Let us begin by summarizing some results of Chapter 10 for the two-layer, feed forward
network:
x0> = (x1 . . . xN ) input observation
y> = (0 . . . 1 . . . 0)
>
0>
x = (1, x )
h
I =W
0
h>
biased input observation
activation vector for the hidden layer
n = g (I )
output signal vector from the hidden layer
n> = (1, n0> )

o
I =W
o>
class label
(C.2)
ditto with bias
activation vector for the output layer
output vector from the network.
m = g (I )
The corresponding activation functions are, for the hidden neurons,

g h (Ijh ) =
1
h
j = 1 . . . L,
(C.3)
k = 1 . . . M.
(C.4)
1 + eIj
and for the output neurons,

o
(Iko )
eIk
= PM
k0 =1
eIk0
The first derivatives of the local cost function with respect to the output and hidden weights,
(10.19) and (10.22), can be written concisely as
E
= n o>
Wo
E
= x h> ,
Wh
(C.5)
o = y m
(C.6)
where
C.1. THE HESSIAN MATRIX

and
153
0
h

= n (1 n) Wo o .
(C.7)
With Bishop [Bis95] we introduce the R-operator according to

Rv {} := v>
,
w
Obviously we have
v> = (v1 . . . vnw ).
Rv {w} =
vj
w
= v.
wj
We adopt the convention that the result of applying the R-operator has the same structure
as the argument to which it is applied. Thus for example
Rv {Wh } = Vh ,
where Vh , like Wh , is an (N + 1) L matrix consisting of the first (N + 1) L components
of the vector v.
Next we derive an expression for v> H in terms of the R-operator.
(v> H)j =
nw
X
vi Hij =
i=1
(v H)j = v
w
>
E
wj
w
X
2E
=
vi
wi wj
w
i
i=1
vi
i=1
or
>
nw
X

= Rv
E
wj
E
wj

,
j = 1 . . . nw .
Since v> H is a row vector, this can be written

v > H = Rv
E
w>

Rv
E
Wh

, Rv
E
Wo

.
(C.8)
Note the reorganization of the structure in the argument of Rv , namely w> (Wh , Wo ).
This is merely for convenience. Once the expressions on the right have been evaluated, the
result must be flattened back to a row vector. Equation (C.1.1) is understood to involve
the local cost function. In order to complete the calculation we must sum over all training
pairs.
Applying the chain rule to (C.5),

E
= nRv { o> } RV {n} o>
Wo

E
Rv
= xRv { h> },
Wh
Rv
so that, in order to evaluate (C.1.1), we only need expressions for

Rv {n}, Rv { o } and Rv { h }.
(C.9)
154
Determination of Rv {n}
From (C.2) we can write

Rv {n} =
0
Rv {n0 }

(C.10)
and, from the chain rule,

Rv {n0 } = n0 (1 n0 ) Rv {Ih }
(C.11)
Rv {Ih } = Vh> x.
(C.12)
and
is interpreted as an L (N + 1)-dimensional
Note that, according to our convention, V
matrix, since the argument Ih is a vector of length L.
h>
Determination of Rv { o }
With (C.6) and (C.2) we get
Rv { o } = Rv {m} = v>
m
= g o 0 (Io ) Rv {Io },
w
where the prime denotes differentiation, or

Rv { o } = m (1 m) Rv {Io }.
(C.13)
Again with (C.2) we determine

Rv {Io } = Wo> Rv {n} + Vo> n,
(C.14)
where Rv {n} is determined by (C.1012).

Determination of Rv { h }
We begin by writing (C.7) in the form

0
0
0
Wo o .
=
h
g h (Ih )
Operating with Rv {},

0
0
0
0
o o
00
0
=
W
Vo o
+
h
Rv { }
g h (Ih )
Rv {Ih }
g h (Ih )

0
0
+
Wo Rv { o }.
g h (Ih )
Now we use the derivatives of the activation function
0
g h (Ih ) = n0 (1 n0 )
00
g h (Ih ) = n0 (1 n0 )(1 2n0 )

to obtain

h
i
0
0
= n (1 n) (1 2n)
Wo o + Vo o + Wo Rv { o } (C.15)
h
h
Rv {I }
Rv { }
in which all of the terms on the right have now been determined.
As already mentioned, equation () has been evaluated in terms of the local cost function.
The final step in the calculation involves summing over all of the training pairs.
C.1. THE HESSIAN MATRIX
C.1.2
155
Calculating the Hessian
To calculate the Hessian matrix for the neural network, we evaluate (C.1.1) successively for
the vectors
v1> = (1, 0, 0 . . . 0) . . . vn>w = (0, 0, 0 . . . 1)
and build up H row for row:
v1> H
H = ... .
vn>w H
The following excerpt from the IDL program FFNCG DEFINE (see Appendix D) implements a vectorized version of the preceding calculation of v> H and H:
Function FFNCG::Rop, V
nw = self.LL*(self.NN+1)+self.MM*(self.LL+1)
; reform V to dimensions of Wh and Wo and transpose
VhT = transpose(reform(V[0:self.LL*(self.NN+1)-1],self.LL,self.NN+1))
Vo = reform(V[self.LL*(self.NN+1):*],self.MM,self.LL+1)
VoT = transpose(Vo)
; transpose the weights
WhT = transpose(*self.Wh)
Wo = *self.Wo
WoT = transpose(Wo)
; vectorized forward pass
X = *self.Xs
Zeroes = fltarr(self.p)
Ones = Zeroes + 1.0
N = [[Ones],[1/(1+exp(-WhT##X))]]
Io = WoT##N
A = exp(Io)
sum = total(A,2)
M = fltarr(self.p,self.MM)
for k=0,self.MM-1 do M[*,k] = A[*,k]/sum
; evaluation of v^T.H
D_o = *self.Ys-M
; dô
RIh = VhT##X
; Rv{I^h}
RN = N*(1-N)*[[Zeroes],[RIh]]
; Rv{n}
RIo = WoT##RN + VoT##N
; Rv{Iô}
Rd_o = -M*(1-M)*RIo
; Rv{dô}
Rd_h = N*(1-N)*((1-2*N)*[[Zeroes],[RIh]]*(Wo##D_o) + Vo##D_o + Wo##Rd_o)
Rd_h = Rd_h[*,1:*]
; Rv{d^h}
REo = -N##transpose(Rd_o)-RN##transpose(D_o) ; Rv{dE/dWo}
REh = -X##transpose(Rd_h)
; Rv{dE/dWh}
return, [REh[*],REo[*]]
; v^T.H
End
156
Function FFNCG::Hessian
nw = self.LL*(self.NN+1)+self.MM*(self.LL+1)
v = diag_matrix(fltarr(nw)+1.0)
H = fltarr(nw,nw)
for i=0,nw-1 do H[*,i] = self -> Rop(v[*,i])
return, H
End
C.2
Scaled conjugate gradient training
The backpropagation algorithm of Chapter 10 attempts to minimize the cost function locally,
that is, weight updates are made immediately after presentation of a single training pair to
the network. We will now consider a global approach aimed at minimization of the full cost
function (10.15), which we denote in the following E(w). The symbol w is, as before, the
nw -component vector of synaptic weights.
Now let the gradient of the cost function at the point w be g(w), i.e.
g(w)
E(w),
wi
The Hessian matrix

(H)ij =
2 E(w)
wi wj
i = 1 . . . nw .
i, j = 1 . . . nw
can then be expressed conveniently as the outer product

H=
C.2.1
g(w)> .
w
(C.16)
Conjugate directions
The search for a minimum in the cost function can be visualized as a series of points in the
space of synaptic weight parameters,
w1 , w2 . . . wk1 , wk , wk+1 . . . ,
whereby the point wk is determined by minimizing E(w) along some search direction dk1
which originated at the preceding point wk1 . This is illustrated in Figure C.1 and corresponds to the vector equation
wk = wk1 + k1 dk1 .
(C.17)
Here dk1 is a unit vector along the chosen search direction and the scalar k1 minimizes
the cost function along that direction:

k1 = arg min E wk1 + dk1 .
(C.18)
If, starting from wk , we now wish to take the next minimizing step in the weight space,
it is not efficient simply to choose, as in backpropagation, the direction of the local gradient
g(wk ) at the new starting point wk . It follows namely from (C.18) that

E wk1 + dk1 =
=0
k1
C.2. SCALED CONJUGATE GRADIENT TRAINING

wk1
k1
dk1
157
*
dk ?
R
wk
g(wk )
?
Figure C.1: Search directions in weight space.
or
>
E wk1 + k1 dk1 dk1 = g(wk )> dk1 = 0.

w
(C.19)
The gradient g(wk ) at the new point wk is thus always orthogonal to the preceding search
direction dk1 . This is indicated in Figure C.1. Since the algorithms has just succeeded
in reducing the gradient of the cost function along dk1 to zero, we would prefer to choose
the search direction dk so that the component of the gradient along the old search direction
remains as small as possible. Otherwise we are undoing what we have just accomplished.
Therefore we choose dk according to the condition
g wk + dk
>
dk1 = 0.
But to first order in we have, with (C.16),

g wk + dk
>
= g(wk )> + dk
>
>
g(wk )> = g(wk )> + dk H
w
and the above condition is, with (C.19), equivalent to

>
dk Hdk1 = 0.
(C.20)
Directions which satisfy Equation (C.20) are referred to as conjugate directions.
C.2.2
Minimizing a quadratic function
Of course the neural network cost function is not quadratic in the synaptic weights. However
within a sufficiently small region of weight space it can be approximated as a quadratic
function. We describe in the following an efficient procedure to find the global minimum of
a quadratic function of w having the general form
1
E(w) = E0 + b> w + w> Hw,
2
(C.21)
158
where b and H are constant and the matrix H is positive definite.

The local gradient at the point w is given by
E(w) = b + Hw,
w
g(w) =
and at the global minimum w ,
b + Hw = 0.
(C.22)
Now let {dk | k = 1 . . . nw } be a set of conjugate directions satisfying (C.20),1

>
dk Hd` = 0
for k 6= `, k, ` = 1 . . . nw .
(C.23)
The search directions dk are linearly independent. In order to demonstrate this let us assume
the contrary, that is, that there exists an index k and constants k0 , k 0 6= k, not all of which
are zero, such that
nw
X
0
dk =
k0 dk .
k0 =1
k0 6=k
But from (C.23) we have at once

k0 dk
0>
for k 0 6= k
Hdk = 0
and, since H is positive definite,

k0 = 0
for k 0 6= k.
The assumption thus leads to a contradiction and the dk are indeed linearly independent.
The conjugate directions thus constitute a (non-orthogonal) vector basis for the entire weight
space.
In the search for the global minimum suppose we begin at an arbitrary point w1 and
express the vector w w1 spanning the distance to the global minimum as a linear combination of the basis vectors dk :
nw
X
w w1 =
k dk .
(C.24)
k=1
Further, define
wk = w1 +
k1
X
` d`
(C.25)
`=1
and split (C.24) up into nw steps

wk+1 = wk + k dk ,
k = 1 . . . nw .
(C.26)
At the kth step the search starts at the point wk and proceeds a distance k along the
conjugate direction dk . After nw such steps the global minimum w is reached, since from
(C.24C.26) it follows that
w = w1 +
nw
X
k dk = wnw + nw dnw = wnw +1 .
k=1
1 It
can be shown that such a set always exists, see e.g. [Bis95].
159
>
We get the necessary step sizes k from (C.24) by multiplying from the left with d` H,
>
>
d` Hw d` Hw1 =
nw
X
>
k d` Hdk .
k=1
From (C.22) and (C.23) we can write this as

>
>
d` (b + Hw1 ) = ` d` Hd` ,
and an explicit formula for the step sizes is given by
>
` =
d` (b + Hw1 )
>
d` Hd`
` = 1 . . . nw .
But with (C.24) and (C.25),

>
>
dk Hwk = dk Hw1 + 0,
and therefore, replacing index k by `,
>
>
d` Hw` = d` Hw1 .
The step lengths are thus
>
` =
d` (b + Hw` )
>
d` Hd`
` = 1 . . . nw .
Finally, using the notation g` = g(w` ) = b + Hw` and substituting ` k,

>
k =
dk g k
>
dk Hdk
k = 1 . . . nw .
(C.27)
For want of a better alternative we can choose the first search direction along the negative
local gradient
E(w1 ).
d1 = g1 =
w
(Note that d1 is not a unit vector.) We move according to (C.27) a distance
>
1 =
d1 d1
d1 > Hd1
along this direction to the point w2 , at which the local gradient g2 is orthogonal to d1 . We
then choose the new conjugate search direction d2 as a linear combination of the two:
d2 = g2 + 1 d1
or, at the kth step,
dk+1 = gk+1 + k dk .
(C.28)
160

>
We get the coefficient k from (C.28) and (C.20) by multiplication on the left with dk H:
>
>
0 = dk Hgk+1 + k dk Hdk ,
from which follows
>
k =
gk+1 Hdk
>
dk Hdk
(C.29)
Equations (C.26C.29) constitute a recipe with which, starting at an arbitrary point w1

in weight space, the global minimum of the quadratic function (C.21) is found in precisely
nw steps.
C.2.3
The algorithm
Returning now to the non-quadratic neural net cost function E(w) we will apply the above
method to minimize it. We must take two things into consideration.
First of all, the Hessian matrix H is neither constant nor everywhere positive definite.
We will denote its local value at the point wl as Hk . When Hk is not positive definite it
can happen that (C.27) leads to a step along the wrong direction the numerator might
turn out to be negative. Therefore we replace (C.27) with2
>
k =
dk g k
>
dk Hdk + k |dk |2
k = 1 . . . nw .
(C.30)
The constant k is supposed to ensure that the denominator in (C.30) is always positive. It
is initialized for k = 1 with a small numerical value. If, at the kth iteration, it is determined
that
>
k := dk Hdk + k (dk )2 < 0,
k given by
then k is replaced by the larger value

k
k = 2 k k 2 .
|d |
(C.31)
This ensures that the denominator in (C.30) becomes positive again. Note that this increase
in k has the effect of decreasing the step size k , as is apparent from (C.30).
Second, we must take into account any deviation of the cost function from its local
quadratic approximation. Such deviations are to be expected for large step sizes k . As a
measure of the quadricity of E(w) along the chosen step length we can use the ratio
k =
2 This
2 E(wk ) E(wk + k dk )
>
k dk gk
corresponds to the substitution Hk Hk + k I, where I is the identity matrix.
(C.32)
161
This quantity is precisely 1 for a strictly quadratic function like (C.21). Therefore we can
use the following heuristic: For the k + 1st iteration
if k > 3/4,
k+1 := k /2
if k < 1/4,
k+1 := 4k
else,
k+1 := k .
In other words, if the local quadratic approximation looks good according to criterion (C.32),
then the step size can be increased (k+1 is reduced relative to k ). If this is not the case
then the step size is decreased (k+1 is made larger).
All of which leads us finally to the following algorithm (see e.g. [Moe93])
Algorithm (Scaled Conjugate Gradient)
1. Initialize the synaptic weights w with random numbers, set k = 0, = 0.001 and
d = g = E(w)/w.
2. Set = d> Hd + |d|2 . If < 0, set = 2( /d2 ) and = d> Hd. Save the
current cost function E1 = E(w).
3. Determine the step size = d> g/ and new synaptic weights w = w + d.
4. Calculate the quadricity = (E1 E(w))/( d> g). If < 1/4, restore the old
weights: w = w d, set = 4, d = g and go to 2.
5. Set k = k + 1. If > 3/4 set = /2.
6. Determine the new local gradient g = E(w)/w and the new search direction d =
g + d, whereby, if k mod nw 6= 0 then = g> Hd/(d> Hd) else = 0.
7. If E(w) is small enough stop, else go to 2.
A few remarks on this algorithm:
The integer k counts the total number of iterations. Whenever k mod nw = 0 exactly
nw weight updates have been carried out and the minimum of a truly quadratic function would have been reached. This is taken as a good stage at which to restart the
search along the negative local gradient g rather than continuing along the current
conjugate direction d. One expects that approximation errors will gradually corrupt
the determination of the conjugate directions and the fresh start is intended to
counter this.
Whenever the quadricity condition is not filled, i.e. whenever < 1/4, the last weight
update is cancelled and the search again restarted along g.
Since the Hessian only occurs in the forms d> H, and g> H, it can be determined
efficiently with the R-operator method.
Here is an excerpt from the object FFNCG class extending FFN, showing the training
method which implements scaled conjugate gradient algorithm:
162
Pro FFNCG::Train
w = [(*self.Wh)[*],(*self.Wo)[*]]
nw = n_elements(w)
g = self->gradient()
d = -g
; search direction, row vector
k = 0L
lambda = 0.001
window,12,xsize=600,ysize=400,title=FFN(scaled conjugate gradient)
wset,12
title=Training: epoch number...,xsize=250,ysize=20)
progressbar->start
eivminmax = ?
repeat begin
return
endif
d2 = total(d*d)
; d^2
dTHd = total(self->Rop(d)*d)
; d^T.H.d
delta = dTHd+lambda*d2
if delta lt 0 then begin
lambda = 2*(lambda-delta/d2)
delta = -dTHd
endif
E1 = self->cost()
; E(w)
(*self.cost_array)[k] = E1
dTg = total(d*g)
; d^T.g
alpha = -dTg/delta
dw = alpha*d
w = w+dw
*self.Wh = reform(w[0:self.LL*(self.NN+1)-1],self.LL,self.NN+1)
*self.Wo = reform(w[self.LL*(self.NN+1):*],self.MM,self.LL+1)
; E(w+dw)
E2 = self->cost()
Ddelta = -(E1-E2)/(alpha*dTg)
; quadricity
if Ddelta lt 0.25 then begin
w = w - dw
; undo change in the weights
*self.Wh = reform(w[0:self.LL*(self.NN+1)-1],self.LL,self.NN+1)
*self.Wo = reform(w[self.LL*(self.NN+1):*],self.MM,self.LL+1)
lambda = 4*lambda
; decrease step size
d = -g
; restart along gradient
end else begin
k++
if Ddelta gt 0.75 then lambda = lambda/2
g = self->gradient()
if k mod nw eq 0 then begin
beta = 0
eivs = self->eigenvalues()
eivminmax = string(min(eivs)/max(eivs),format=(F10.6))
C.3. KALMAN FILTER TRAINING
163
end else beta = total(self->Rop(g)*d)/dTHd

d = beta*d-g
plot,*self.cost_array,xrange=[0,k>100],color=0,background=FFFFFFXL,$
ytitle=cross entropy,xtitle= $
Epoch [+textoidl(min(\lambda)/max(\lambda)=)+eivminmax+]
endelse
progressbar->Update,k*100/self.iterations,text=strtrim(k,2)
endrep until k gt self.iterations
End
C.3
Kalman filter training
In this Section we apply the recursive least squares method described in Appendix A to
train the feed forward neural network of Figure 10.4. The appropriate cost function is the
quadratic function (10.13) or, more specifically, its local version (10.14).

0 n(` + 1)

wko

1
1

.. n1 (` + 1) ~
m(` + 1)
q
.

: k

>
j
nj (` + 1)

..
.

L nL (` + 1)

Figure C.2: An isolated output neuron.

We begin with consideration of the training process of an isolated neuron. Figure C.2
depicts an output neuron in the network during presentation of the ` + 1st training pair
(x(` + 1), y(` + 1)). The neuron receives its input from the hidden layer (input vector
n(` + 1)) and generates the signal
o>
ewk
mk (` + 1) = g(wko> n(` + 1)) = PM
k0 =1
n(`+1)
o> n(`+1
ewk0
k = 1 . . . M,
which is compared to the desired output y(` + 1). It is easy to show that differentiation with
respect to wko yields
g(wko> n(` + 1)) = mk (` + 1)(1 mk (` + 1))n(` + 1).

wko
(C.33)
and with respect to n,
g(wko> n(` + 1)) = mk (` + 1)(1 mk (` + 1))wko (` + 1).

n
(C.34)
164
C.3.1
Linearization
We shall drop for the time being the indices on wko , writing it simply as w. Let us call w(`)
an approximation to the desired synaptic weight vector for our isolated output neuron, one
which has been achieved so far in the training process, i.e. after presentation of the first `
training pairs. Then a linear approximation to mk (` + 1) can be obtained by expanding in
a first order Taylor series about the point w(`),
m(` + 1) g(w(`)> n(` + 1)) +
>
g(w(`)> n(` + 1))

(w w(`)).
w
With (C.34) we can then write

m(` + 1) m(`
+ 1) + m(`
+ 1)(1 m(`
+ 1))n(` + 1)> (w w(`)),
(C.35)
where m(`
+ 1) is given by
m(`
+ 1) = g(w(`)> n(` + 1)).
With the definition of the linearized input
A(` + 1) = m(`
+ 1)(1 m(`
+ 1))n(` + 1)>
(C.36)
we can write (C.35) in the form

m(` + 1) A(` + 1)w + [m(`
+ 1) A(` + 1)w(`)].
The term in square brackets is - to first order - the error that arises from the fact that the
neurons output signal is not simply proportional to w. If we neglect it altogether, then we
get the linearized neuron output signal
m(` + 1) = A(`)w.
In order to calculate the synaptic weight vector w, we can now apply the theory of recursive
least squares developed in Appendix A. We simply identify the parameter vector a with the
synaptic weight vector w. We then have the least squares problem

A`
y`
=
w + .
y(` + 1)
A(` + 1)
The Kalman filter equations for the recursive solution of this problem, Eq. (A.15), are
unchanged:

`+1 = I K`+1 A(` + 1) `
1

K`+1 = ` A(` + 1)> A(` + 1)` A(` + 1)> + 1 .
while the recursive expression (A.14) for the parameter vector becomes

w(` + 1) = w(`) + K`+1 y(` + 1) A(` + 1)w(`) .
(C.37)
165
This can be improved somewhat by replacing the linear approximation to the system output
A(` + 1)w(`) by the actual output for the ` + 1st training observation, namely m(`
+ 1), so
we have

w(` + 1) = w(`) + K`+1 y(` + 1) m(`
+ 1) .
C.3.2
(C.38)
The algorithm
The recursive calculation of w is depicted in Figure C.3. The input is the current weight
vector w(`), its covariance matrix ` and the output vector of the hidden layer n(` + 1)
obtained by propagating the next input observation x(` + 1) through the network. After
determining the linearized input A(` + 1), Eq. (C.36), the Kalman gain K`+1 and the new
covariance matrix `+1 are calculated with (C.37). Finally, the weights are updated in
(C.38) to give w(` + 1) and the procedure is repeated.
n(` + 2)
n(` + 1)

? A(` + 1)K`+1
- C.37
C.36
>
6
y(` + 1)

? A(` + 2)
C.36
3

?
w(`
+ 1)
- C.38

w(`)
`+1
`
Figure C.3: Determination of the synaptic weights for an isolated neuron with the Kalman
filter.
To make our notation explicit for the output neurons, we substitute
y(`) yk (`)
w(`) wko (`)

m(`
+ 1) m
k (` + 1) = g wko> (`)n(` + 1)
>
A(` + 1) Aok (` + 1) = m
k (` + 1)(1 m
k (` + 1))n(` + 1)
K` Kok (`)
` ok (`),
for k = 1 . . . M . Then (C.38) becomes

wko (` + 1) = wko (`) + Kok (` + 1) y(` + 1) m
k (` + 1) ,
k = 1 . . . M.
(C.39)
Recalling that we wish to minimize the local quadratic cost function E(`) given by Eq.
(10.14), note that the expression in square brackets above is in fact the negative derivative
166
of E(`) with respect to the output signal of the neuron, i.e.

y(` + 1) m(`
+ 1) =
E(`)
mk (`)
so that
wko (`
+ 1) =
wko (`)
Kok (`
E(`)
+ 1)
mk (`)

.
(C.40)
m
k (`+1)
With this result, we can turn consideration to the hidden neurons, making the substitutions
w(`) wjh (`)

m(`
+ 1) n
j (` + 1) = g wjh> (`)x(` + 1)
>
A(` + 1) Ahj (` + 1) = n
j (` + 1)(1 n
j (` + 1))x(` + 1)
K` Khj (`)
` hj (`),
for j = 1 . . . L. Then, analogously to (C.40), the update equation for the weight vector of
the jth hidden neuron is

E(` + 1)
h
h
h
wj (` + 1) = wj (`) Kj (` + 1)
.
(C.41)
nj (` + 1) n j (`+1)
To obtain the partial derivative in (C.41), we differentiate the cost function (10.14)
X
mk ( + 1)
E(` + 1)
(yk (` + 1) mk (` + 1))
=
.
nj (` + 1)
nj ( + 1)
M
k=1
o
, we have
From (C.34), noting that (wko )j = Wjk
mk (` + 1)
o
= mk (` + 1)(1 mk (` + 1))Wjk
(` + 1)
nj (` + 1)
Combining the last two equations,
X
E(` + 1)
o
=
(yk (` + 1) mk (` + 1))mk (` + 1)(1 mk (` + 1))Wjk
(` + 1)
nj (` + 1)
M
k=1
which we can write more compactly as

E(` + 1)
o
= Wj
(` + 1) o (` + 1),
nj (` + 1)
(C.42)
o
where Wj
is the jth row of the output layer weight matrix, and where
o (` + 1) = (y(` + 1) m(` + 1)) m(` + 1) (1 m(` + 1)).

The correct update relation for the weights of the jth hidden neuron is therefore
o

(` + 1) o (` + 1) .
wjh (` + 1) = wjh (`) + Khj (` + 1) Wj
(C.43)
Apart from initialization of the covariance matrices hj (` = 0) and o (` = 1), the

Kalman training procedure has no adjustable parameters whatsoever. The covariance matrices are simply taken to be proportional to the corresponding identity matrices:
hj (0) = ZIh ,
ok (0) = ZIo ,
Z 1, j = 1 . . . L, k = 1 . . . M,
where I is the (N + 1) (N + 1) and I the (L + 1) (L + 1) identity matrix. We choose

Z = 100 and obtain
h
167
Algorithm (Kalman filter training)

1. Set ` = 0, hj (0) = 100 Io , j = 1 . . . L, ok (0) = 100 Ih , k = 1 . . . M and initialize the
synaptic weight matrices Wh (0) and Wo (0) with random numbers.
2. Choose a training pair (x(`+1), y(`+1)) and determine the hidden layer output vector

1

(` + 1) =
n
g Wh (`)> x(` + 1)
and with it the quantities
>
Ahj (` + 1) = n
j (` + 1)(1 n
j (` + 1))x(` + 1) ,
j = 1 . . . L,

m
k (` + 1) = g wko> (`)
n(` + 1)
>
Aok (` + 1) = m
k (` + 1)(1 m
k (` + 1))
n(` + 1) ,
k = 1...M
and
+ 1)) m(`
+ 1) (1 m(`
+ 1)).
o (` + 1) = (y(` + 1) m(`
3. Determine the Kalman gains for all of the neurons according to
1
>
Aok (` + 1)ok (`)Aok (` + 1) + 1 ,
1
>
>
Khk (` + 1) = hj (`)Ahj (` + 1) Ahj (` + 1)hj (`)Ahj (` + 1) + 1 ,
Kok (` + 1) = ok (`)Aok (` + 1)
>
k = 1...M
j = 1...L
4. Update the synaptic weight matrices:

wko (` + 1) = wko (`) + Kok (` + 1)[yk (` + 1) m
k (` + 1)], k = 1 . . . M
o

h
h
h
wj (` + 1) = wj (`) + Kj (` + 1) Wj (` + 1) o (` + 1) , j = 1 . . . L
5. Determine the new covariance matrices:

ok (` + 1) = Io Kok (` + 1)Aok (` + 1) ok (`),

hj (` + 1) = Ih Khj (` + 1)Ahj (` + 1) hj (`),
k = 1...M
j = 1...L
6. If the overall cost function (10.13) is sufficiently small, stop, else set ` = ` + 1 and go
to 2.
This method was originally suggested by Shah and Palmieri [SP90], who called it the multiple extended Kalman algorithm (MEKA). Here is an excerpt from the object FFNKAL class
extending FFN, showing the class method which implements the Kalman filter algorithm:
Pro FFNKAL:: train
; define update matrices for Wh and Wo
dWh = fltarr(self.LL,self.NN+1)
dWo = fltarr(self.MM,self.LL+1)
iter = 0L
iter100 = 0L
title=Training: exemplar number...,xsize=250,ysize=20)
168
;
;
;
;
;
;
;
;
;
;
;
;
;
;
progressbar->start
window,12,xsize=600,ysize=400,title=FFF(Kalman filter)
wset,12
repeat begin
return
endif
select exemplar pair at random
ell = long(self.p*randomu(seed))
x=(*self.Xs)[ell,*]
y=(*self.Ys)[ell,*]
send it through the network
m=self->forwardPass(x)
error at output
e=y-m
loop over the output neurons
for k=0,self.MM-1 do begin
linearized input (column vector)
Ao = m[k]*(1-m[k])*(*self.N)
Kalman gain
So = (*self.So)[*,*,k]
SA = So##Ao
Ko = SA/((transpose(Ao)##SA)[0]+1)
determine delta for this neuron
dWo[k,*] = Ko*e[k]
update its covariance matrix
So = So - Ko##transpose(Ao)##So
(*self.So)[*,*,k] = So
endfor
update the output weights
*self.Wo = *self.Wo + dWo
backpropagated error
beta_o =e*m*(1-m)
loop over the hidden neurons
for j=0,self.LL-1 do begin
linearized input (column vector)
Ah = X*(*self.N)[j+1]*(1-(*self.N)[j+1])
Kalman gain
Sh = (*self.Sh)[*,*,j]
SA = Sh##Ah
Kh = SA/((transpose(Ah)##SA)[0]+1)
determine delta for this neuron
dWh[j,*] = Kh*((*self.Wo)[*,j+1]##beta_o)[0]
update its covariance matrix
Sh = Sh - Kh##transpose(Ah)##Sh
(*self.Sh)[*,*,j] = Sh
endfor
update the hidden weights
169
*self.Wh = *self.Wh + dWh

; record cost history
if iter mod 100 eq 0 then begin
(*self.cost_array)[iter100]=alog10(self->cost())
iter100 = iter100+1
progressbar->Update,iter*100/self.iterations,text=strtrim(iter,2)
plot,*self.cost_array,xrange=[0,iter100],color=0,background=FFFFFFXL,$
xtitle=Iterations/100),ytitle=log(cross entropy)
end
iter=iter+1
endrep until iter eq self.iterations
End
170
Appendix D
ENVI Extensions
D.1
Installation
To install the complete extension package:

1. Place the file cursor motion.pro in your save add directory. In FilePreferencesUser
Defined Motion Routine in the ENVI main menu enter: cursor motion.
2. Place the remaining .PRO files anywhere in your IDL !PATH.
3. Place the files madviewhelp.pdf and aboutmadview.pdf in your IDL !HELP PATH.
4. Under Preferences in the ENVI main menu select the tab User Defined Files.
Under Envi Menu File and Display Menu File enter the paths to the two menu
files envi.men and display.me provided in the package.
171
172
D.2
D.2.1
APPENDIX D. ENVI EXTENSIONS
Topographic modelling
Calculating building heights
CALC HEIGHT is an ENVI extension to determine height of vertical buildings in QuickBird/Ikonos images using rational function models (RFMs) provided with ortho-ready imagery. It is invoked as
Tools/Building Height
from the ENVI display menu.
Usage
Load an RFM file in the CalcHeight window with File/Load RPC File (extension RPC
or RPB). If a DEM is available for the scene, this can also be loaded with File/Load DEM
File. A DEM is not required, however. Click on the bottom of a vertical structure to set
the base height and then shift-click on the top of the structure. Press the CALC button
to display the structures height, latitude, longitude and base elevation. The number in
brackets next to the height is the minimum distance (in pixels) between the top pixel and a
vertical line through the bottom pixel. It should be of the order of 1 or less.
If no DEM is loaded, the base elevation is the average value for the whole scene. If
a DEM is used, the base elevation is taken from it. The latitude and longitude are then
orthorectified values.
Source headers
;+
; NAME:
;
CALCHEIGHT
; PURPOSE:
;
Determine height (and lat, long, elevation) of vertical buildings
;
in QuickBird/Ikonos images using RPCs
; AUTHOR;
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
CalcHeight
; ARGUMENTS:
;
Event (if used as a plug-in menu item)
; KEYWORDS:
None
;
; COMMON BLOCKS:
;
Shared, RPC, Cb, Rb, Ct, Rt, elev
;
Cursor_Motion_C, dn, Cbtext, Rbtext, Cttext, Rttext
;
RPC: structure with RPC camera model
;
Cb, Rb: coordinates of building base
;
Ct, Rt: coordinates of building top
;
elev: elevation of base
;
dn: display number
D.2. TOPOGRAPHIC MODELLING
173
;
Cbtext ... : Edit widgets
; DEPENDENCIES:
;
ENVI
;
CURSOR_MOTION
; -------------------------------------------------------------
;+
; NAME:
;
CURSOR_MOTION
; PURPOSE:
;
Cursor communication with ENVI image windows
; AUTHOR;
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
Cursor_Motion, dn, xloc, yloc, xstart=xstart, ystart=ystart, event=event
; ARGUMENTS:
;
dn: display number
;
xloc,yloc: mouse position
; KEYWORDS
;
xstart, ystart: display origin
;
event: mouse event
; COMMON BLOCKS:
;
Cursor_Motion_C, dn, Cbtext, Rbtext, Cttext, Rttext
; DEPENDENCIES:
;
None
;--------------------------------------------------------------------------
D.2.2
Illumination correction
C CORRECTION is an ENVI extension for local illumination correction for multispectral images. It is invoked from the ENVI main menu as
Topographic/Illumination Correction.
Usage
From the Choose image for correction menu select the (spectral/spatial subset of the)
image to be corrected. Then in the C-correction parameters box enter the solar elevation
and azimuth in degrees and, if desired, a new size for the kernel used for slope/aspect
determination (default 99). In the Choose digital elevation file window select the
corresponding DEM file. Finally in the Output corrected image box choose an output file
name or select memory.
174
Source headers
;+
; NAME:
;
C_CORRECTION
; PURPOSE:
;
ENVI extension for c-correction for solar illumination in rough terrain
;
Ref: D. Riano et al. IEEE Transactions on
;
Geoscience and Remote Sensing, 41(5) 2003, 1056-1061
; AUTHOR
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
C_Correction
; ARGUMENTS:
;
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;------------------------------------------------------------------------
D.3. IMAGE REGISTRATION
D.3
175
Image registration
CONTOUR MATCH is an ENVI extension for determination of ground control points (GCPs) for
image-image registration. It is invoked from the ENVI main menu as
Map/Registration/Contour Matching.
Usage
In the Choose base image band window enter a (spatial subset) of the base image. Then
in the Choose warp image band window select the image to be warped. In the LoG sigma
box choose the size of the Laplacian of Gaussian filter kernel. The default is 25 ( = 2.5).
Finally in the Save GCPs to ASCII menu enter a file name (extension .pts) for the CGPs.
After the calculation, these can then be loaded and inspected in the usual ENVI image-image
registration dialog.
Source headers
;+
; NAME:
;
CONTOUR_MATCH
; PURPOSE:
;
ENVI extension for extraction of ground control points for image-image registration
;
Images may be already georeferenced, in which case GCPs are for "fine adjustement"
;
Uses Laplacian of Gaussian filter and contour tracing to match closed contours
;
Ref: Li et al, IEEE Transactions on Image Processing, 4(3) (1995) 320-334
; AUTHOR
Mort Canty (2004)
;
;
;
; CALLING SEQUENCE:
;
Contour_Match
; ARGUMENTS:
;
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
CI_DEFINE
;
PROGRESSBAR_DEFINE (FSC_COLOR)
;-----------------------------------------------------------------------;+
; NAME:
;
CI__DEFINE
; PURPOSE:
;
Find thin closed contours in an image band with combined Sobel-LoG filtering
;
Ref: Li et al, IEEE Transactions on Image Processing, 4(3) (1995) 320-334
; AUTHOR
;
Mort Canty (2004)
176
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;

CALLING SEQUENCE:
ci = Obj_New("CI",image,sigma)
ARGUMENTS:
image: grayscale image band to be searched for contours
sigma: Gaussian radius for LoG filter
KEYWORDS
None
METHODS:
GET_MAX_CONTOURS
return maximum number of closed contours (8000)
GET_MAX_LENGTH
return maximum contour length (200)
GET_CONTOUR_IMAGE
return contour (filtered) image
CLEAR_CONTOUR_IMAGE erase contour image
TO_PIXELS
read contour structure and return its pixel array
TO_FILTERED_CC
read contour structure and return filtered chain code
TO_MOMENTS
read contour structure and return Hu invariant moments
WRITE_CONTOUR
read contour stucture and display on contour_image
TRACE_CONTOUR
search contour image for next closed contour
DEPENDENCIES:
None
Contour structure
c = { sp:
intarr(2),
$
; starting point coordinates
length: 0L,
$
; number of pixels
closed: -1L,
$
; set to zero while tracing, 1 when
closed, -1 when no more contours
code:
bytarr(self.max_length),
$
; chain code
icode:
bytarr(self.max_length)
}
; rotationally invariant chain code
---------------------------------------------------------------------
D.4. IMAGE FUSION
D.4
D.4.1
177
Image fusion
DWT fusion
ARSIS DWT is an ENVI extension for panchromatic sharpening with the discrete wavelet
transform (DWT). It is invoked from the ENVI main menu as
Transform/Image Sharpening/Wavelet(ARSIS Model)/DWT
Usage
In the Select low resolution multi-band input file window choose the (spatial/spectral
subset of the) image to be sharpened. In the Select hi res input band window choose the
corresponding panchromatic or high resolution image. Then in the ARSIS Fusion Output
box select an output file name or memory.
Source headers
;+
; NAME:
;
ARSIS_DWT
; PURPOSE:
;
ENVI extension for panchromatic sharpening under ARSIS model
;
with Mallats discrete wavelet transform and Daubechies wavelets
;
Ref: Ranchin and Wald, Photogramm. Eng. Remote. Sens.
;
66(1), 2000, 49-61
; AUTHOR
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
ARSIS_DWT
; ARGUMENTS:
;
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
DWT__DEFINE(PHASE_CORR)
;
ORTHO_REGRESS
;-----------------------------------------------------------------------;+
; NAME:
;
DWT__DEFINE
; PURPOSE:
;
Discrete wavelet transform class using Daubechies wavelets
;
for construction of pyramid representations of images, fusion etc.
;
Ref: T. Ranchin, L. Wald, Photogammetric Engineering and
;
Remote Sensing 66(1) (2000) 49-61.
178
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
AUTHOR
Mort Canty (2004)
CALLING SEQUENCE:
dwt = Obj_New("DWT",image)
ARGUMENTS:
image: grayscale image to be compressed
KEYWORDS
None
METHODS:
SET_COEFF: choose the Daubechies wavelet
dwt -> Set_Coeff, n
n = 4,6,8,12
SHOW_IMAGE: display the image pyramid in a window
dwt -> Show_Image, wn
INJECT: overwrite upper left quadrant
after phase correlation match if keyword pc is set (default)
dwt -> Inject, array, pc = pc
SET_COMPRESSIONS: set the number of compressions
dwt -> Set_Compressions, nc
GET_COMPRESSIONS: get the number of compressions
nc = dwt -> Get_Compressions()
GET_NUM_COLS: get the number of columns in the compressed image
cols = dwt -> Get_Num_Cols()
GET_NUM_ROWS: get the number of rows in the compressed image
cols = dwt -> Get_Num_Rows()
GET_IMAGE: return the pyramid image
im = dwt -> Get_Image()
GET_QUADRANT: get compressed image (as 2D array) or innermost
wavelet coefficients as vector
wc = dwt -> Get_Quadrant(n)
n = 0,1,2,3
NORMALIZE_WC: normalize wavelet coefficients at all levels
dwt -> Normalize, a, b
a, b are normalization parameters
COMPRESS: perform a single compression
dwt -> Compress
dwt -> Inject, array, pcr=pc
EXPAND: perfrom a single expansion
dwt -> Expand
DEPENDENCIES:
PHASE_CORR
---------------------------------------------------------------------
;+
; NAME:
;
PHASE_CORR
; PURPOSE:
;
Returns relative offset [xoff,yoff] of two images using phase correlation
D.4. IMAGE FUSION
179
;
Maximum offset should not exceed +- 5 pixels in each dimension
;
Returns -1 if dimensions are not equal
;
Ref: H, Shekarforoush et al. INRIA 2707
; AUTHOR
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
shft = Phase_Corr(im1,im2,display=display,subpixel=subpixel)
; ARGUMENTS:
;
im1, im2: the images to be correlated
; KEYWORDS:
;
Display: (optional) show a surface plot if the correlation
;
in window with display number display
;
Subpixel: returns result to subpixel accuracy if set,
;
otherwise nearest integer (default)
; DEPENDENCIES:
;
None
;---------------------------------------------------------------------------
; NAME:
;
ORTHO_REGRESS
; PURPOSE:
;
Orthogonal regression between two vectors
;
Ref: M. Canty et al. Remote Sensing of Environment 91(3,4) (2004) 441-451
; AUTHOR
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
Ortho_Regress, X, Y, a, Xm, Ym, sigma_a, sigma_b
;
regression line is Y = Ym + a(X-Xm) = (Ym-aXm) + aX = b + aX
; ARGUMENTS:
;
input column vectors X and Y
;
returns a, Xm, Ym, sigma_a, sigma_b
; KEYWORDS:
;
None
; DEPENDENCIES:
;
None
;-------------------------------------------------------------------
D.4.2
ATWT fusion
ARSIS ATWT is an ENVI extension for panchromatic sharpening with the `

a trous wavelet
transform (ATWT). It is invoked from the ENVI main menu as
Transform/Image Sharpening/Wavelet(ARSIS Model)/ATWT
180
Usage
In the Select low resolution multi-band input file window choose the (spatial/spectral
subset of the) image to be sharpened. In the Select hi res input band window choose the
corresponding panchromatic or high resolution image. Then in the ARSIS Fusion Output
box select an output file name or memory.
Source headers
;+
; NAME:
;
ARSIS_ATWT
; PURPOSE:
;
ENVI extension for panchromatic sharpening under ARSIS model
;
with "A trous" wavelet transform.
;
Ref: Aiazzi et al, IEEE Transactions on Geoscience and
;
Remote Sensing, 40(10) 2300-2312, 2002
; AUTHOR
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
ARSIS_ATWT
; ARGUMENTS:
;
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
ATWT__DEFINE(WARP_SHIFT, PHASE_CORR)
;
ORTHO_REGRESS
;-----------------------------------------------------------------------;+
; NAME:
;
ATWT__DEFINE
; PURPOSE:
;
A Trous wavelet transform class using Daubechies wavelets.
;
Used for shift invariant image fusion
;
Ref: Aiazzi et al. IEEE Transactions on Geoscience and
;
Remote Sensing 40(10) (2002) 2300-2312
; AUTHOR
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
atwt = Obj_New("ATWT",image)
; ARGUMENTS:
;
image: grayscale image to be processed
; KEYWORDS
D.4. IMAGE FUSION
181
;
None
; METHODS:
;
SHOW_IMAGE: display the image pyramid in a window
;
dwt -> Show_Image, wn
;
INJECT: overwrite the filtered image
;
dwt -> Inject, im
;
SET_TRANSFORMS: set the number of transformations
;
dwt -> Set_Transforms, nc
;
GET_TRANSFORMS: get the number of transformations
;
nc = dwt -> Get_Transforms()
;
GET_NUM_COLS: get the number of columns in the compressed image
;
cols = dwt -> Get_Num_Cols()
;
GET_NUM_ROWS: get the number of rows in the compressed image
;
cols = dwt -> Get_Num_Rows()
;
GET_IMAGE: return filtered image or details
im = dwt -> Get_Image(i)
;
;
i = 0 for filters image, i > 0 for details
;
NORMALIZE_WC: normalize details at all levels
;
dwt -> Normalize, a, b
;
a, b are normalization parameters
;
COMPRESS: perform a single transformation
;
dwt -> Compress
;
EXPAND: perfrom a single reverse transformation
;
dwt -> Expand
; DEPENDENCIES:
;
WARP_SHIFT
;
PHASE_CORR
; ---------------------------------------------------------------------
;+
; NAME:
;
WARP_SHIFT
; PURPOSE:
;
Use RST with bilinear interpolation to shift band to sub-pixel accuracy
; AUTHOR
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
sband = Warp_Shift(band,shft)
; ARGUMENTS:
;
band: the image band to be shifted
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;---------------------------------------------------------------------------
182
D.4.3
Quality index
RUN QUALITY INDEX is an ENVI extension to determine the Wang-Bovik quality index of a
pan-sharpened image. It is invoked from the ENVI main menu as
Transform/Image Sharpening/Quality Index
Usage
From the Choose reference image menu select the multispectral image to which the sharpened image is to be compared. In the Choose pan-sharpened image menu, select the image
whose quality is to be determined.
Source headers
;+
; NAME:
;
RUN_QUALITY_INDEX
; PURPOSE:
;
ENVI extension for radiometric comparison of two multispectral images
;
Ref: Wang and Bovik, IEEE Signal Processing Letters 9(3) 2002, 81-84
; AUTHOR
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
Run_Quality_Index
; ARGUMENTS:
;
Run_Quality_Index, Event (if used as a plug-in menu item)
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
QI
;-------------------------------------------------------------------------;+
; NAME:
;
QI
; PURPOSE:
;
Determine the Wang-Bovik quality index for a pan-sharpened image band
;
Ref: Wang and Bovik, IEEE Signal Processing Letters 9(3) 2002, 81-84
; AUTHOR
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
index = QI(band1,band2)
; ARGUMENTS:
;
band1: reference band
D.4. IMAGE FUSION

;
band2: degraded pan-sharpened band
; KEYWORDS:
;
None
; DEPENDENCIES:
;
None
;----------------------------------------------------------------------
183
184
D.5
D.5.1
Change detection
Multivariate Alteration Detecton
MAD RUN is an ENVI extension for change detection with the MAD transformation. It is
invoked from the ENVI main menu as
Basic Tools/Change Detection/MAD
Usage
From the Choose first image window enter the first (spatial/spectral subset) of the two
image files. In the Choose second image window enter the second image file name. The
spatial and spectral subsets must be identical. If an input image is in BSQ format, it is
converted in place, after a warning, to BIP. In the MAD Output box choose a file name or
memory. The calculation begins and can be interrupted at any time with the Cancel button.
Before output, the spatial subset for the final MAD transformation can be changed, e.g.
extended to a full scene, if desired.
Source headers
;+
; NAME:
;
MAD_RUN
; PURPOSE:
;
ENVI extension for Multivariate Alteration Detection.
;
Ref: A. A. Nielsen et al. Remote Sensing of Environment 64 (1998), 1-19
;
Uses spectral tiling and therefore suitable for large datasets.
;
Reads in two registered multispectral images (spectral/spatial subsets
;
must have the same dimensions, spectral subset size must be at least 2).
;
If an input image is in BSQ format, it is converted in place to BIP.
;
Writes the MAD variates to disk.
; AUTHOR
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
Mad_Run
; ARGUMENTS:
;
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
MAD_TILED (COVPM_DEFINE, GEN_EIGENPROBLEM)
;
;--------------------------------------------------------------------;+
; NAME:
D.5. CHANGE DETECTION
185
;
MAD_TILED
; PURPOSE:
;
Function for Multivariate Alteration Detection.
;
;
Uses spectral tiling and therefore suitable for large datasets.
;
Input files must be BIL or BIP format.
;
On error or if interrupted during the first iteration, returns = -1 else 0
; AUTHOR
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
result = Mad_Tiled(fid1,fid2,dims1,dims2,pos1,pos2)
; ARGUMENTS:
fid1, fid2
input file specifications
;
;
dims1, dims2
;
pos1, pos2
; KEYWORDS:
;
A, B
output: transformation eigenvectors
;
means1, means2
weighted mean values for transformation, row-replicated
;
cp
change probability image from chi-square distribution
; DEPENDENCIES:
;
ENVI
;
COVPM_DEFINE
;
GEN_EIGENPROBLEM
;
;--------------------------------------------------------------------;+
; NAME:
;
COVPM__DEFINE
; PURPOSE:
;
Object class for iterative covariance matrix calculation
;
using the method of provisional means.
; AUTHOR
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
covpm = Obj_New("COVPM")
; ARGUMENTS:
;
None
; KEYWORDS
;
None
; METHODS:
;
UPDATE: update the covariance matrix with an observation
;
covpm -> Update, v, weight = w
;
v is an obsevation vector (array)
;
w is an optioanl weight for that observation
;
COVARIANCE: read out the covariance matrix
186
;
cov = covpm -> Covariance()
;
MEANS: read out the observation means
;
mns = covpm -> Means()
; DEPENDENCIES:
;
None
;-------------------------------------------------------------;+
; NAME:
GEN_EIGENPROBLEM
;
; PURPOSE:
;
Solve the generalized eigenproblem
;
C##a = lambda*B##a
;
using Cholesky factorization
; AUTHOR:
;
Mort Canty (2001)
;
;
; CALLING SEQUENCE:
;
Gen_Eigenproblem, C, B, A, lambda
; ARGUMENTS:
;
C and B are real, square, symmetric matrices
;
returns the eigenvalues in the row vector lambda
;
returns the eigenvectors a as the columns of A
; KEYWORDS:
;
None
; DEPENDENCIES
;
None
;---------------------------------------------------------------------
D.5.2
Maximum autocorrelation factor
MAF is an ENVI extension for performing the MAF transformation, usually on previously
calculated MAD variates. It is invoked from the ENVI main menu as
Basic Tools/Change Detection/MAF (of MAD)
Usage
In the Choose multispectral image box select the file to be transformed. In the MAF
Output box select an output file name or memory.
Source headers
;+
; NAME:
;
MAF
; PURPOSE:
;
ENVI extension for Maximum Autocorrelation Fraction transformation.
;
Ref: Green et al, IEEE Transaction on Geoscience and Remote Sensing,
D.5. CHANGE DETECTION
187
;
26(1):65-74,1988
; AUTHOR
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
Maf
; ARGUMENTS:
;
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
GEN_EIGENPROBLEM
;---------------------------------------------------------------------
D.5.3
Radiometric normalization
RADCAL is an ENVI extension for radiometric normalization with the MAD transformation.
It is invoked from the ENVI main menu as
Basic Tools/Change Detection/MAD Radiometric Normalization
Usage
From the Choose reference image window enter the first (spatial/spectral subset) of the
two image files. In the Choose target image window enter the second image file name.
The spatial and spectral subsets must be identical. If an input image is in BSQ format, it
is converted in place, after a warning, to BIP. In the MAD Output box choose a file name or
memory. The calculation begins and can be interrupted at any time with the Cancel button.
In a series of plot windows the regression lines used for the normalization are plotted. The
results can be then used to calibrate another file, e.g. a full scene.
Source headers
;+
; NAME:
;
RADCAL
; PURPOSE:
;
Radiometric calibration using MAD
;
Ref: M. Canty et al. Remote Sensing of Environment 91(3,4) (2004) 441-451
;
Reference and target images must have equal spatial and spectral dimensions,
;
at least 2 spectral components, and be registered to one another.
;
Once the regression coefficients have been determined, they can be used to
;
calibrate another file, for example a full scene, which need not be registered
;
to the reference image.
; AUTHOR
;
Mort Canty (2004)
;
;
188
; CALLING SEQUENCE:
;
Radcal
; ARGUMENTS:
;
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
ORTHO_REGRESS
;
MAD_TILED (COVPM_DEFINE, GEN_EIGENPROBLEM)
;
;-----------------------------------------------------------------
D.6. UNSUPERVISED CLASSIFICATION
D.6
D.6.1
189
Unsupervised classification
Hierarchical clustering
HCLRUN is an ENVI extension for agglomerative hierarchical clustering. It is invoked from

the ENVI main menu as
Classification/Unsupervised/Hierarchic
It is intended as a demonstration, and writes to memory only.
Usage
In the Choose multispectral image for clustering window select the (spatial/spectral
subset of the) desired image. In the Number of Samples box choose the size of the representative random sample (default 1000). In the Number of Classes box select the desired
number of clusters.
Source headers
;+
; NAME:
;
HCLRUN
; PURPOSE:
;
ENVI extension for hierarchical agglomerative clustering
; AUTHOR
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
HCLrun
; ARGUMENTS:
;
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
HCL (PROGRESSBAR__DEFINE (FSC_COLOR))
;
CLASS_LOOKUP_TABLE
;--------------------------------------------------------------------;+
; NAME:
;
HCL
; PURPOSE:
;
Agglomerative hierarchic clustering with sum of squares cost function.
;
Takes data array Xs (column vectors) and number of clusters K as input.
;
Returns cluster memberships Cs.
;
Ref. Fraley Technical Report 311, Dept. of Statistics,
;
University of Washington, Seattle (1996).
; AUTHOR
190
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
HCL, Xs, K, Cs
; ARGUMENTS:
;
Xs: input observations array (column vectors)
K: number of clusters
;
;
Cs: Cluster labels of observations
; KEYWORDS:
;
None
; DEPENDENCIES:
PROGRESSBAR__DEFINE (FSC_COLOR)
;
;-------------------------------------------------------------------;+
; NAME:
;
CLASS_LOOKUP_TABLE
; PURPOSE:
;
Provide 16 class colors for supervised and unsupervised classification programs
; AUTHOR;
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
colors = Class_Lookup_Table(Ptr)
; ARGUMENTS:
;
Ptr: a vector of pointers into the table
; KEYWORDS:
;
None
; DEPENDENCIES:
;
None
;---------------------------------------------------------------------
D.6.2
Fuzzy K-means clustering
SAMPLE FKMRUN is an ENVI extension for fuzzy k-means clustering. It is invoked from the
ENVI main menu as
Classification/Unsupervised/Fuzzy-K-Means
Usage
In the Choose multispectral image window select the (spatial/spectral subset of the)
desired image. In the Number of Classes box select the desired number of clusters. In the
FKM Output box select the output file name or memory.
Source headers
;+
191
; NAME:
;
SAMPLE_FKMRUN
; PURPOSE:
;
ENVI extension for fuzzy K-means clustering with sampled data
; AUTHOR
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
Sample_FKMrun
; ARGUMENTS:
;
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
FKM (PROGRESSBAR_DEFINE (FSC_COLOR))
;
CLUSTER_FKM
;
CLASS_LOOKUP_TABLE
;--------------------------------------------------------------------;+
; NAME:
;
FKM
; PURPOSE:
;
Fuzzy Kmeans clustering algorithm.
;
Takes data array Xs (data as column vectors), number of clusters K.
;
Returns fuzzy membership matrix U and the class centers Ms.
;
Ref: J. C. Dunn, Journal of Cybernetics, PAM1:32-57, 1973
; AUTHOR
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
FKM, Xs, K, U, Ms, niter=niter, seed=seed
; ARGUMENTS:
;
;
K: number of clusters
;
U: final class probability membership matrix (output)
;
Ms: cluster means (output)
; KEYWORDS:
;
niter: number of iterations (optional)
;
seed: initial random number seed (optional)
; DEPENDENCIES:
;
;-------------------------------------------------------------------;+
; NAME:
;
CLUSTER_FKM
192
; PURPOSE:
;
Modified distance clusterer from IDL library
; CALLING SEQUENCE:
;
labels = Cluster_fkm(Array,Weights,Double=Double,N_clusters=N_clusters)
;-------------------------------------------------------------------------
D.6.3
EM clustering
SAMPLE EMRUN is an ENVI extension for EM clustering. It is invoked from the ENVI main
menu as
Classification/Unsupervised/EM(Sampled)
TILED EMRUN can be used to cluster large data sets. It is invoked from the ENVI main menu
as
Classification/Unsupervised/EM(Tiled)
Usage
In the Choose multispectral image for clustering window select the (spatial/spectral
subset of the) desired image. In the Number of Samples box choose the size of the representative random sample (default 1000). In the Number of Classes box select the desired
number of clusters. In the FKM Output box select the output file name or memory. In
the Output class membership probs box select the output file name for the probabilities
(rule) image, or Cancel if this is not desired. The rule image will be byte coded (0 = probability 0, 255 = probability 1). In the tiled version, output to memory is not possible. During
calculation a log likelihood plot is shown. Calculation can be interrupted at any time.
Source headers
;+
; NAME:
;
SAMPLE_EMRUN
; PURPOSE:
;
ENVI extension for EM clustering with sampled data
; AUTHOR
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
Sample_EMrun
; ARGUMENTS:
;
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
EM (PROGRESSBAR__DEFINE (FSC_COLOR))
;
CLUSTER_EM
;
CLASS_LOOKUP_TABLE
193
;--------------------------------------------------------------------;+
; NAME:
;
TILED_EMRUN
; PURPOSE:
;
ENVI extension for EM clustering on sampled data, large data sets
; AUTHOR
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
Tiled_EMrun
; ARGUMENTS:
;
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
EM (PROGRESSBAR__DEFINE (FSC_COLOR))
;
FKM
;
CLUSTER_EM
;
CLASS_LOOKUP_TABLE
;--------------------------------------------------------------------;+
; NAME:
;
EM
; PURPOSE:
;
Expectation maximization clustering algorithm for Gaussian mixtures.
;
Takes data array Xs (data as column vectors) and initial
;
class membership probability matrix U as input.
;
Returns U, the class centers Ms, Priors Ps and final
;
class covariances Fs.
;
Allows for simulated annealing
;
Ref: Gath and Geva, IEEE Trans. Pattern Anal. and Mach.
;
Intell. 3(3):773-781, 1989
;
Hilger, Exploratory Analysis of Multivariate Data,
;
PhD Thesis, IMM, Technical University of Denmark, 2001
; AUTHOR
;
Mort Canty (2005)
;
;
; CALLING SEQUENCE:
;
Pro EM, Xs, U, Ms, Ps, Fs, unfrozen=unfrozen, wnd=wnd, $
;
maxiter=maxiter, miniter=miniter, verbos=verbose, $
;
pdens=pdens, pd_exclude=pdens_exclude, fhv=fhv, T0=T0
; ARGUMENTS:
;
;
U: initial class probability membership matrix (column vectors)
194
;
Ms: cluster means (output)
;
Ps: cluster priors (output)
;
Fs: cluster covariance matrices (output)
; KEYWORDS:
;
unfrozen: Indices of the observations which
;
take part in the iteration (default all)
;
wnd: window for displaying the log likelihood (optional)
;
maxinter: maximum iterations (optional)
;
minimter: minimum iterations (optional)
;
pdens: partition density (output, optional)
;
pd_exclude: array of classes to be excluded from pdens and fhv (optional)
;
fhv: fuzzy hypervolume (output, optional)
;
T0: initial annealing temperature (default 1.0)
;
verbose: set to print output info to IDL log
; DEPENDENCIES:
;
;-------------------------------------------------------------------;+
; NAME:
;
CLUSTER_EM
; PURPOSE:
;
Cluster data after running the EM algorithm
;
Takes data array (as row vectors), means Ms (as row vectors), priors Ps
;
and covariance matrices Fs and returns the class labels.
;
Class membership probabilities are returned in class_probs
; AUTHOR
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
labels = Cluster_EM(Xs,Ms,Ps,Fs,class_probs=class_probs,progress_bar=progress_bar)
;
; ARGUMENTS:
;
Xs: data array
;
Ms: cluster means
;
Ps: cluster priors
;
Fs: cluster covariance matrices
; KEYWORDS:
;
class_probs (optional): contains cluster membership probability image
;
progress_bar: set to 0 if no progressbar is desired
; DEPENDENCIES:
;
;--------------------------------------------------------------------
D.6.4
Probabilistic label relaxation
PLR is an ENVI extension for performing probabilistic relaxation on rule (class membership
probability) images generated by supervised and unsupervised classification algorithms. It
is invoked from the ENVI main menu as
195
Classification/Post Classification/Probabilistic Label Relaxation/Run PLR

Usage
In the Choose probabilities image window select the rule image to be processed. Select
the number of iterations (default 3) in the Number of iterations box. Finally choose a
file name for the output rule image in the PLR output box.
PLR RECLASS is an ENVI extension for (re)classifying on rule (class membership probability) images generated by supervised and unsupervised classification algorithms and probabilistic label relaxation. It is invoked from the ENVI main menu as
Classification/Post Classification/Probabilistic Label Relaxation/Reclassify
Usage
In the Choose classification file optionally specify a previous classification image.
This determines the color coding of the reclassification output. If cancelled, a default
color code is used. In the Choose probabilities window, select the file to be processed.
Answer the Include dummy unclassified class query with Yes. In the Output PLR
reclassification box choose an output file name or memory.
Source headers
;+
; NAME:
;
PLR
; PURPOSE:
ENVI extension for postclassification with
;
;
Probabilistic Label Relaxation
;
Ref. Richards and Jia, Remote Sensing Digital Image Analysis (1999) Springer
;
Processes a rule image (class membership probabilities), outputs a
;
new rule image
; AUTHOR;
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
Plr
;
; ARGUMENTS
;
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
;---------------------------------------------------------------------;+
; NAME:
;
PLR_RECLASS
196
; PURPOSE:
;
ENVI extension for postclassification with
;
Probabilistic Label Relaxation
;
Ref. Richards and Jia, Remote Sensing Digital Image Analysis (1999) Springer
;
Processes a rule image (class membership probabilities), outputs a
;
new classification file
; AUTHOR;
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
Plr_Reclass
; ARGUMENTS
;
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
;----------------------------------------------------------------------
D.6.5
Kohonen self organizing map
SAMPLE SOMRUN is an ENVI extension for clustering with the Kohonen self-organizing map.
Classification/Unsupervised/Kohonen SOM
Usage
In the Choose multispectral image window select the (spatial/spectral subset of the)
desired image. In the Cube side dimension box select the desired dimension of the cubic
neural network (default 6). In the SOM Output box select the output file name or memory.
Source headers
;+
; NAME:
;
SAMPLE_SOMRUN
; PURPOSE:
;
ENVI extension for Kohonen Self Organizing Map with sampled data
Ref. T. Kohonen, Self Organization and Associative Memory, Springer 1989.
;
; AUTHOR
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
Sample_KFrun
; ARGUMENTS:
;
197
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
;---------------------------------------------------------------------
D.6.6
A GUI for change clustering
MAD VIEW is an IDL GUI (graphical user interface) for viewing and processing MAD and
MNF/MAD change images. It is invoked from the ENVI main menu as
Basic Tools/Change Detection/MAD View
Usage
This extension is provided with an on-line help.
Source headers
;+
; NAME:
;
MAD_VIEW
; PURPOSE:
;
GUI for viewing, thresholding and clustering MAD/MNF images
;
;
A. A. Nielsen private communication (2004)
; AUTHOR
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
Mad_View
; ARGUMENTS:
;
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
EM
;
CLUSTER_EM
;
;---------------------------------------------------------------------
198
D.7
Neural network: Scaled conjugate gradient
FFNCG RUN is an ENVI extension for supervised classification with a two-layer feed forward
neural network. It uses the scaled conjugate gradient training algorithm and can be used as
a replacement for the much slower backpropagation neural network implemented in ENVI.
Classification/Supervised/Neural Net/Conjugate Gradient
Usage
In the Enter file for classification window select the (spatial/spectral subset of the)
desired image. This must be in BIP format. In the ROI selection box choose the training
regions desired. In the Output FFN classification to file box select the output file
name. In the Output FFN probabilities to file box select the output file name for the
probabilities (rule) image, or Cancel if this is not desired. The rule image will be byte
coded (0 = probability 0, 255 = probability 1). In the Number of hidden units box select
the number of neurons in the first layer (default 4). As the calculation proceeds, the cost
function is displayed in a plot window. The calculation can be interrupted with Cancel.
Source headers
;+
; NAME:
;
FFNCG_RUN
; PURPOSE:
;
ENVI extension for classification of a multispectral image
;
with a feed forward neural network using scaled conjugate gradient training
; AUTHOR;
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
FfnCG_Run
; ARGUMENTS
;
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
;
FFNCG__DEFINE (FFN__DEFINE)
;---------------------------------------------------------------------;+
; NAME:
;
FFNCG__DEFINE
; PURPOSE:
;
;
D.7. NEURAL NETWORK: SCALED CONJUGATE GRADIENT
199
;
Implements scaled conjugate gradient training.
;
Ref: C. Bishop, Neural Networks for Pattern Recognition, Oxford 1995
;
M. Canty, Fernerkundung mit neuronalen Netzen, Expert 1999
; AUTHOR
;
Mort Canty (2005)
;
;
; CALLING SEQUENCE:
;
ffn = Obj_New("FFNCG",Xs,Ys,L)
; ARGUMENTS:
;
;
;
L:
; KEYWORDS
;
None
; METHODS:
;
ROP: determine the matrix product v^t.H, where H is the Hessian of
;
the cost function wrt the weights, using the R-operator
;
r = ffn -> Rop(v)
;
HESSIAN: claculate the Hessian
;
h = ffn -> Hessian()
;
EIGENVALUES: calculate the eigenvalues of the Hessian
e = ffn -> Eigenvalues()
;
;
GRADIENT: calculate the gradient of the global cost function
;
g = ffn -> Gradient()
;
;
ffn -> train
; DEPENDENCIES:
;
FFN__DEFINE
;
PROGRESSBAR (FSC_COLOR)
;-------------------------------------------------------------;+
; NAME:
;
FFN__DEFINE
; PURPOSE:
;
;
;
This is a generic class with no training methods.
;
; AUTHOR
Mort Canty (2005)
;
;
;
; CALLING SEQUENCE:
;
ffn = Obj_New("FFN",Xs,Ys,L)
; ARGUMENTS:
;
;
Ys: array of class label column vectors of form (0,0,1,0,0,...0)^T
;
L:
200
; KEYWORDS
;
None
; METHODS (external):
;
OUTPUT: return a class membership probability vector for an observation
;
row vector x
;
p = ffn -> Output(x)
;
CLASS: return the class for an observation row vector x
;
p = ffn -> Class(x)
; DEPENDENCIES:
;
None
;--------------------------------------------------------------
D.8
Neural network: Kalman filter
FFNKAL RUN is an ENVI extension for supervised classification with a two-layer feed forward
neural network. It uses a fast Kalman Filter training algorithm and can be used as a
replacement for the much slower backpropagation neural network implemented in ENVI. It
is invoked from the ENVI main menu as
Classification/Supervised/Neural Net/Kalman Filter
Usage
probabilities (rule) image, or Cancel if this is not desired. The rule image will be byte coded
(0 = probability 0, 255 = probability 1). In the Number of hidden units box select the
number of neurons in the first layer (default 4). As the calculation proceeds, the logarithm
of the cost function is displayed in a plot window. The calculation can be interrupted with
Cancel.
Source headers
;+
; NAME:
;
FFNKAL_RUN
; PURPOSE:
;
Classification of a multispectral image with feed forward neural network
;
using Kalman filter training
; AUTHOR;
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
FfnKal_Run
; ARGUMENTS
;
D.9. NEURAL NETWORK: HYBRID
201
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
;
FFNKAL__DEFINE (FFN__DEFINE)
;---------------------------------------------------------------------;+
; NAME:
;
FFNKAL__DEFINE
; PURPOSE:
;
;
;
Implements Kalman filter training.
;
; AUTHOR
;
Mort Canty (2005)
;
;
; CALLING SEQUENCE:
;
ffnkal = Obj_New("FFNKAL",Xs,Ys,L)
; ARGUMENTS:
;
;
;
L:
; KEYWORDS
;
None
; METHODS:
;
OUTPUT (inherited): return a class membership probability vector for an observation
;
row vector x
;
p = ffnbp -> Output(x)
CLASS (inherited): return the class for an observation row vector x
;
;
p = ffnbp -> Class(x)
;
;
ffnkal -> train
; DEPENDENCIES:
;
FFN__DEFINE
;
PROGRESSBAR(FSC_COLOR)
;--------------------------------------------------------------
D.9
Neural network: Hybrid
FFN RUN is an ENVI extension for supervised classification with a two-layer feed forward
neural network. It uses both the Kalman filter and the scaled conjugate gradient training
algorithm and can be used as a replacement for the much slower backpropagation neural
network implemented in ENVI. It is invoked from the ENVI main menu as
Classification/Supervised/Neural Net/Hybrid
202
Usage
probabilities (rule) image, or Cancel if this is not desired. The rule image will be byte coded
(0 = probability 0, 255 = probability 1). In the Number of hidden units box select the
number of neurons in the first layer (default 4). As the Kalman filter calculation proceeds,
the log of the cost function is displayed in a plot window. The calculation can be interrupted
with Cancel. Then calculation continues where the Kalman filter left off with the scaled
conjugate gradient training method. The calculation can again be interrupted with Cancel.
Source headers
;+
; NAME:
;
FFN_RUN
; PURPOSE:
;
ENVI extension for classification of a multispectral image
;
with a feed forward neural network using Kalman filter
;
plus scaled conjugate gradient training
; AUTHOR;
;
Mort Canty (2004)
;
;
; CALLING SEQUENCE:
;
Ffn_Run
; ARGUMENTS
;
; KEYWORDS:
;
None
; DEPENDENCIES:
;
ENVI
;
;
FFNKAL__DEFINE (FFN__DEFINE)
;
FFNCG__DEFINE
;----------------------------------------------------------------------
Bibliography
[AABG02] B. Aiazzi, L. Alparone, S. Baronti, and A. Garzelli. Context-driven fusion of
high spatial and spectral resolution images based on oversampled multiresolution
analysis. IEEE Transactions on Geoscience and Remote Sensing, 40(10):2300
2312, 2002.
[And84]
T. W. Anderson. An Introduction to Multivariate Statistical Analysis, 2nd Edition. Wiley Series in Probability and Mathematical Statistics, 1984.
[AS99]
E. Aboufadel and S. Schlicker. Discovering Wavelets. Wiley, 1999.
[BFH75]
Y. M. M. Bishop, E. E. Feinberg, and P. W. Holland. Discrete Multivariate

Analysis, Theory and Practice. Cambridge Press, 1975.
[Bil89]
C. M. Bilbo. Statistisk analyse af relationer mellem alternative antistoftracere.

Masters thesis, Informatics and Mathematical Modeling, Technical University
of Denmark, Lyngby, 1989. In Danish.
[Bis95]
C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University

Press, 1995.
[BP00]
L. Bruzzone and D. F. Prieto. Automatic analysis of the difference image for

unsupervised change detection. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 11(4):11711182, 2000.
[CNS04]
M. J. Canty, A. A. Nielsen, and M. Schmidt. Automatic radiometric normalization of multitemporal satellite imagery. Remote Sensing of Environment,
91(3,4):441451, 2004.
[DH73]
R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley,

1973.
[Dun73]
J. C. Dunn. A fuzzy relative of the isodata process and its use in detecting
compact well-separated clusters. Journal of Cybernetics, PAM1-1:3257, 1973.
[Fra96]
C. Fraley. Algorithms for model-based gaussian hierarchical clustering. Technical

report 311, Department of Statistics, University of Washington, Seattle, 1996.
[GG89]
I. Gath and A. B. Geva. Unsupervised optimal fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intellegence, 3(3):773781, 1989.
[GW02]
R. C. Gonzalez and R. E. Woods. Digital Image Processing. Addison Wesley,

2002.
203
204
BIBLIOGRAPHY
[Hab95]
P. Haberacher. Praxis der Digitalen Bildverarbeitungen und Mustererkennung.

Carl Hanser Verlag, 1995.
[Hil01]
K. B. Hilger. Exploratory Analysis of Multivariate Data. PhD Thesis, IMMPHD-2001-89, Technical University of Denmark, 2001.
[HKP91]
J. Hertz, A. Krogh, and R. G. Palmer. Introduction to the Theory of Neural

Computation. Addison-Wesley, 1991.
[Hu62]
M. K. Hu. Visual pattern recognition by moment invariants. IRE Transactions

on Information Theory, IT-8:179187, 1962.
[JRR99]
X. Jia, J. A. Richards, and D. E. Ricken. Remote Sensing Digital Image Analysis.

Springer-Verlag, 1999.
[Koh89]
T. Kohonen. Self-Organization and Associative Memory. Springer-Verlag, 1989.
[KS79]
M. Kendall and A. Stuart. The Advanced Theory of Statistics, volume 2. Charles

Griffen & Company Limited, fourth edition, 1979.
[LMM95]
H. Li, B. S. Manjunath, and S. K. Mitra. A contour-based approach to multisensor image registration. IEEE Transactions on Image Processing, 4(3):320334,
1995.
[Mal89]
S. G. Mallat. A theory for mutiresolution signal decomposition: the wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence,
11(7):674693, 1989.
[Mil99]
A. S. Milman. Mathematical Principles of Remote Sensing. Sleeping Bear Press,

1999.
[Moe93]
M. F. Moeller. A scaled conjugate gradient algorithm for fast supervised learning.

Nearal Networks, 6:525533, 1993.
[NCS98]
A. A. Nielsen, K. Conradsen, and J. J. Simpson. Multivariate alteration detection (MAD) and MAF processing in multispectral, bitemporal image data:
New approaches to change detection studies. Remote Sensing of Environment,
64:119, 1998.
[Pal98]
G. Palubinskas. K-means clustering algorithm using the entropy. SPIE (European Symposium on Remote Sensing, Conference on Image and Signal Processing
for Remote Sensing), September, Barcelona, Vol 3500:6371, 1998.
[Pat77]
W. M. Patefield. On the information matrix in the linear functional problem.

Journal of the Royal Statistical Society, Series C, 26:6970, 1977.
[PFTV86] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical

Recipes. Cambridge University Press, 1986.
[RC96]
B. S. Reddy and B. N. Chatterji. An fft-based technique for translation, rotation

and scale-invariant image registration. IEEE Transactions on Image Processing,
5(8):12661271, 1996.
[RCSA03] D. Riano, E. Chuvieco, J. Salas, and I. Aguado. Assessment of different topographic corrections in Landsat-TM data for mapping vegetation types. IEEE
Transactions on Geoscience and Remote Sensing, 41(5):10561061, 2003.
BIBLIOGRAPHY
205
[Rip96]
B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge University

Press, 1996.
[RW00]
T. Ranchin and L. Wald. Fusion of high spatial and spectral resolution images:
the ARSIS concept and its implementation. Photogrammetric Engineering and
Remote Sensing, 66(1):4961, 2000.
[Sie65]
S. S. Siegel. Nonparametric Statistics for the Behavioral Sciences. McRaw-Hill,

1965.
[Sin89]
A. Singh. Digital change detection techniques using remotely-sensed data. Int.

J, Remore Sensing, 10(6):9891002, 1989.
[SP90]
S. Shah and F. Palmieri. Meka - A fast, local algorithm for training feed forward
neural networks. Proceedings if the International Joint Conference on Neural
Networks, San Diego, I(3):4146, 1990.
[TGG82]
P. M. Teillet, B. Guindon, and D. G. Goodenough. On the slope-aspect correction

of multispectral scanner data. Canadian Journal of Remote Sensing, 8(2):84
106, 1982.
[TH01]
C. V. Tao and Y. Hu. A comprehensive study of the rational function model for
photogrammetric processing. Photogrammetric Engineering and Remote Sensing, 67(12):13471357, 2001.
[WB02]
Z. Wang and A. C. Bovik. A universal image quality index. IEEE Signal Processing Letters, 9(3):8184, 2002.
[Wie97]
R. Wiemker. An iterative spectral-spatial bayesian labelling approach for unsupervised robust change detection on remotely sensed multispectral imagery.
Proceedings of the 7th International Conference on Computer Analysis of Images and Patterns, Springer LCNS Vol 1296:263370, 1997.
[WK91]
S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn. Morgan

Kaufmann, 1991.
206
BIBLIOGRAPHY

Image Analysis and Pattern Recognition For Remote Sensing With Algorithms in ENVI/IDL

Uploaded by

Copyright:

Available Formats

Image Analysis and Pattern Recognition For Remote Sensing With Algorithms in ENVI/IDL

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Image Analysis and Pattern Recognition For Remote Sensing With Algorithms in ENVI/IDL

Uploaded by

Copyright:

Available Formats

Image Analysis and Pattern Recognition

for Remote Sensing

Morton John Canty

A simple cost function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Algorithms that minimize the simple cost function . . . . . . . . . . . . . . . 81

Agglomerative hierarchical clustering . . . . . . . . . . . . . . . . . . . 83

Including spatial information . . . . . . . . . . . . . . . . . . . . . . . 87

The Kohonen Self Organizing Map . . . . . . . . . . . . . . . . . . . . . . . . 89

Unsupervised classification of changes . . . . . . . . . . . . . . . . . . . . . . 91

10.1 Bayes decision rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

11.1 Mixture modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

A.2 Recursive least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

Inner product space . . . . . .

C Advanced Neural Network Training Algorithms

D.7 Neural network: Scaled conjugate gradient . . . . . . . . . . . . . . . . . . . . 198

Images, Arrays and Vectors

Multispectral satellite images

CHAPTER 1. IMAGES, ARRAYS AND VECTORS

1.1. MULTISPECTRAL SATELLITE IMAGES

or digital number g and at-sensor radiance f is determined by the sensor calibration as

whereas in band interleaved by line (BIL) it would be stored as

and in band sequential (BSQ) format it is stored as

CHAPTER 1. IMAGES, ARRAYS AND VECTORS

Algebra of vectors and matrices

It is very convenient to use a vector representation for multispectral images, namely

The transpose of the two-dimensional column vector shown in Fig. 1.2,

The length or norm of the vector x is

1.2. ALGEBRA OF VECTORS AND MATRICES

Figure 1.3: The inner product.

CHAPTER 1. IMAGES, ARRAYS AND VECTORS

a11 b11 + a12 b21

and is another matrix. The determinant of a two-dimensional matrix is

The identity matrix is given by

The matrix inverse A1 is defined in terms of the identity matrix according to

IDL> print, determ(float(a))

Eigenvalues and eigenvectors

1.3. EIGENVALUES AND EIGENVECTORS

This is the same as the two equations

a21 x1 + (a22 )x2 = 0.

is said to diagonalize the matrix a. That is

We can illustrate the whole procedure in IDL as follows:

CHAPTER 1. IMAGES, ARRAYS AND VECTORS

Finding minima and maxima

In order to maximize some desirable property of a multispectral image, such as signal to

and is defined as the vector

The scalar expression

where A is a matrix, is called a quadratic form. We have

1.4. FINDING MINIMA AND MAXIMA

Figure 1.4: A function of one variable.

see Fig. 1.4. Then f (x ) is a local minimum if

> 0. This becomes obvious if we

For |x x | sufficiently small this is equivalent to

The situation is similar for scalar functions of a vector:

where H is called the Hessian matrix:

In the neighborhood of the critical point, since

= 0, we get the approximation

f (x) f (x ) + (x x )> H(x x ).

or, in terms of our notation for vector derivatives,