Unit 4 Computer Vision Lecture Notes 1 4 Compress
Unit 4 Computer Vision Lecture Notes 1 4 Compress
1 Introduction 4
1.1 What is Vision? ... Computer Vision? . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Types of Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Application Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 The Image Processing / Machine Vision Universe . . . . . . . . . . . . . . . . . . . 7
1.5 Why is Computer Vision Difficult? . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Human Color Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Color Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Digitization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.2 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.3 Representation of Digital Images . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Imaging Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Preprocessing 25
3.1 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1
3.1.1 Intensity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.2 Histogram Equalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.3 Some Other Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.1 Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.2 Estimation of Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4.1 Stereo Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4.2 Estimation of Stereo-Correspondence via Optical Flow . . . . . . . . . . . . 44
5 Image Primitives 45
5.1 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Region Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.1 Region Growing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2.2 Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2.3 Split-and-Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Contour Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.1 Contour Following . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3.2 Hough-Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4 Keypoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
June 4, 2012 2
5.4.1 The Scale-Invariant Feature Transform (SIFT) . . . . . . . . . . . . . . . . 54
6.3 Eigenimages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.3.1 Formal Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.3.2 Computation of Eigenfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7 Tracking 63
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
June 4, 2012 3
Chapter 1
Introduction
Vision: One of the 6 human senses (vision, audition [hearing], haptics [touch] + proprioception [the
kinesthetic sense], olfaction [smell], gustation [taste], equilibrioception [sense of balance])
For humans the primary sense of perception (wrt. information density preceived: 10M bits per
second)
Allows perception of 3D scenes on the basis of 2D mappings (produced by the eyes, “images”)
of the scene
(Perception is the process of acquiring and interpreting sensory information.)
Computer Vision: Realization of visual perception capabilities (known from humans) within artifi-
cial systems (e.g. robots)
Example: Recognizing a soccer ball on the playground amoung other soccer playing robots
(e.g. Aibos)
Note: Here, problem is greatly simplified by e.g. giving playground and ball well-defined
colors and using (rather) controlled illumination.
Computer Vision vs Image Processing: Image processing deals with aspects of manipulating and
interpreting (digital) images in general
4
• Broader methodological basis
(includes: restoration, compression, enhancement, ...; image syntesis = computer graph-
ics)
i.e. covers aspects not relevant from the perspective of (human) perception
• Focus mostly on lower level processing steps
(i.e. the more interpretation/reasoning required the more likely a method will be called
“computer vision” rather than “image processing”)
• Most widely used image type: “natural” images i.e. resulting from a scene illuminated with
visible light (=
ˆ the type captured with standard cameras)
Slides: TU Dortmund (Fig. 1), Taipei traffic (Fig. 2), ...
• Infra-red images, i.e. resulting from radiation of hot surfaces (captured e.g. from satelites)
Slide: North America (IR) (Fig. 3)
• Multi-spectral images, i.e. taking into account other parts of the electromagnetic spectrum
(which visible light is part of)
Slide: LandSat image of Amazonas rain-forest region (Fig. 4)
• Depth images (generated by e.g. laser range finder or so called “time-of-flight” cameras)
Slide: Depth image of in-door scene (Fig. 8)
June 4, 2012 5
1.3 Application Areas
Computer Vision
• Face Recognition (i.e. detection and/or identification [e.g. for biometric access control]
Slide: Examples of face detection (Fig. 11, Fig. 12)
• Automatic reading of postal addresses (largely for machine printed texte, for handwritten ap-
prox. 50% “finalization” in the US as of 2000 [cf. IWFHR 7])
• And not to forget about Automatically guided weapons (e.g. cruise missiles use image of goal,
used for correlation search in final phase of flight)
June 4, 2012 6
1.4 The Image Processing / Machine Vision Universe
June 4, 2012 7
1.5 Why is Computer Vision Difficult?
June 4, 2012 8
1.6 Typical Architecture of a Machine Vision System
June 4, 2012 9
Chapter 2
Goal: Mapping/Projection of a 3-dimensional scene onto a 2-dimensional (digital) image (in the
memory of a machine vision system)
Procesing Steps:
a) Visualization
by means of physical processes of ...
b) Image Formation
An imaging system (e.g. a camera, electromagnetic field) projects the radiation originating
from the 3-dimensional scene onto a 2-dimensional image plane
10
2.1.1 Visualization
Most “widely used” / “readily available” type of radiation for the visualization of objects / scenes:
Electromagnetic radiation.
Slide: Overview of the electromagnetic spectrum (Fig. 18)Note: Similar graphic also in [Gon02]
Most important part of the electromagnetic spectrum for human and computer vision is visible light
with wave lengths ranging between approx. 400 to 800 nm.
⇒ Will consider only this type of radiation further!
Remarks:
• The el-mag. spectrum is continuous, i.e. arbitrary wave lengths can occur (ignoring quantum
effects!)
• Real radiation is in general not homogeneous but consists of a mixture of different wave lengths
• Intensity of the radiation (in a given range of wave lengths; for visible light: brightness)
• Composition of the total radiation from parts with different wavelengths (for visible light:
“color” [Beware: Color is not an objective but a subjective measure!])
Intensity
For so-called “grey level/scale images” only the intensity is measured which is reflected from an
object / a scene.
The observed (light) intensity results from:
• the spatial configuration (distance, orientation) of illumination sources and (reflective) object
surfaces
June 4, 2012 11
Surface reflection is composed of
• the absorption of incident light by object surfaces (almost complete for visible light: black
surfaces)
Note: Observed intensity results from interplay of all factors (Is in general dependent on illumination
sources and complete scene)
⇒ Inverse problem in computer graphics: Generating realistic illumination for scenes.
Spectral Composition
• Spectral composition of illumination (here: from el-mag. spectrum, especially light) is primar-
ily dependent on
• The (varying) spectral composition of visible light (i.e. the physical quantity) produces the
sensation of color (i.e. subjective) in the human visual system
Note: There is no one-to-one mapping between spectral composition of light and perceived color!
In image formation an imaging system (mostly: an optical system) projects an image of a 3-dimensional
scene or object onto a 2-dim. image plane.
June 4, 2012 12
A Simple Imaging Device
Experiment: Take a box, prick a small hole in to one of its sides (with a pin), replace the side
opposite to the hole with a translucent plate.
Hold the box in front of you in a dimly lit room, with the pinhole facing a candle (i.e. a
light source) ...
What will you see? — An inverted image of the candle
⇒ Camera Obscura (invented in 16th century)
Idealized optical imaging system / simplest imaging system imaginable (idealization of camera ob-
scura)
Principle: Infinitely small pinhole ensures that exactly one light ray originating from a point in the
scene falls onto a corresponding point in the image plane.
I.e. exactly one light ray passes through each point in the image plane, the pinhole, and some scene point.
Note: Despite its simplicity the pinhole model often provides an acceptable approximation of the
imaging process!
• Perspective projection creates inverted images ⇒ More convenient to consider virtual image
on plane in front of the pinhole.
• Obvious effect of perspective projection: Apparent size of objects is dependent on their dis-
tance.
June 4, 2012 13
after [For03, Fig. 1.4, p. 6]
• Coordinate System (O, i, j, k) attached to the pinhole of the camera (Origin O coincides with
pinhole)
• Line perpendicular to image plane and passing through O is called optical axis, point C ′ image
center
Let P denote some scene point with coordinates (x, y, z) and P ′ its image at (x′ , y ′ , z ′ ).
As P ′ lies in the image plane: z ′ = f ′
~ ′ = λOP
As P , O, and P ′ are colinear: OP ~ for some λ
Consequently, we obtain the following relations between coordinates of scene and image points:
x′ x
′ x′ y′ f′
y = λ y ⇔ λ = = =
x y z
f′ z
and finally:
x y
x′ = f ′ and y ′ = f ′
z z
Note: Perspective projection model can be further simplified by assuming that scene depth is small
with respect to scene distance ⇒ scene points have approximately identical distance z = z0 ⇒
′
constant magnification m = −zf (weak perspective projection)
0
June 4, 2012 14
Cameras With Lenses
Disadvantage of pinhole principle: Not enough light gathered from the scene (only single ray per
image point!) ⇒ Use lens to gather more light from scene and keep image in focus
Note: As real pinholes have finite size the image plane is illuminated by a cone of light rays: The
larger the hole, the wider the cone ⇒ image gets more and more blurred.
Behaviour of lenses defined mainly by geometric optics (ignoring physical effects of e.g. interference,
diffraction, etc.):
surface/interface normal
• In homogeneous media light (rays) travels in straight
lines
α1 α1
• When light rays are reflected from surfaces, this ray,
the reflection, and the surface normal are coplanar; the incident lignt
angles between between normal and rays are comple- (specular) reflection
n1
mentary
n1 sin(α1 ) = n2 sin(α2 )
⇒ Consider refraction and ignore reflection (i.e. won’t consider optical systems which include
mirrors as, e.g., telescopes)
Assumptions:
• Angles between light rays passing through a lens and the interface normal (normal to the re-
fractive surface) are small.
α1 = γ + β 1 α2 = γ − β 2
June 4, 2012 15
• As angles are approximately equal to their sines or tangents:
h
sin γR = h ⇒ γ ≈
R
h
tan βi di = h ⇒ βi ≈ for i = 1, 2
di
)
Note: Relationship between d1 and d2 depends only on R and the indexes of refraction n1 and n2 but
not on βi .
• Lens is thin i.e. ray entering is refracted at the frontal interface and immediately at the backward
one.
• The relations between object and image distance −z/z ′ , size y/y ′ and focal f length of the lens
are defined by the thin lens equation:
1 1 1 R
′
− = where f=
z z f 2(n − 1)
y z
also: = ′
y′ z
• If z approaches infinity, image points are in focus at distance f .
June 4, 2012 16
Note: In practice objects within some limited range of distances (the so-called depth of focus) are in
acceptable focus.
Outside the focal plane images of scene points are projected onto a circle of finite size (i.e.
blurred).
• Spherical aberration:
Light rays from P form a circle of confusion in the image plane. (Circle with minimum diam-
eter = circle of least confusion, in general not located in P ′ )
• Chromatic aberration resulting from the index of refraction beeing dependent on wavelength.
Can be shown using a prism.
• However, multi-lens systems suffer from vignetting effects as light rays are blocked by apper-
tures of different lenses.
June 4, 2012 17
2.2 Structure of the Human Eye cf. [For03, Sec. 1.3], [Gon02, Sec. 2.1]
In general, the human eye is an optical imaging system (i.e. a “camera”), which projects a scaled
down, inverted image of the scene onto the background of the eye (= retina).
Slide: The human eye as an imaging system (Fig. 21)
• The iris relizes the aperture, can be adjusted in size by the cilary muscle (synchronously for
both eyes).
• The lens is flexible and able to adjust its refractive power via the tension of the cilary fibers.
• Muscles rotate the eye such that objects of interest are projected onto the fovea.
• The blind spot, i.e. the path of the optical nerve, contains no photoreceptors (Note: Nerve
connections of photoreceptors lie before the retina).
... is based on 3 types of color sensitive photoreceptors in the retina (cones), which respond differently
to the wavelengths of the visible spectrum.
June 4, 2012 18
• Excitation of receptors is maximum for “red”, “green”, or “blue” light, respectively. However,
cones respond to complete visible spectrum!
• Intensity of response (=
ˆ sensitivity of receptors) is highest for “green” light and decreases
towards the boundaries of the visible spectrum.
• Homogeneous light of a single wavelength creates “pure” color sensation, which, however, can
also be produced by combination of appropriate wavelengths.
Color perception of the human eye can be represented by the CIE-diagram based on the following
definitions:
Let X, Y , Z be the absolute responses of the color sensors
Let the total intensity be: X + Y + Z = I
The relative color portions are then given by:
X X Y Z
x= = y= z=
I X+Y +Z I I
⇒ x + y + z = 1 orz = 1 − x − y (i.e. only 2 independent quantities)
⇒ colors can be represented in x/y -plane
Slide: Chromaticity diagram (Fig. 25)Note: Similar graphic also in [Gon02, Chap. 6]
Remarks:
• Pure spectral colors are found on the perimeter of the “tongue-shaped” color space. Those
colors have maximum saturation.
• Mixing two color components can potentially create all colors lying on the straight line beween
the two points in the diagram (mixing of 3 colors ⇒ triangle).
... motivated by primary sensitivity of human color receptors for (roughly) red, green, and blue light
⇒ RGB
Required: Specification of “base colors” for red, green, and blue (different possibilities!)
For maximisation of colors that can be represented ([Bal82, p. 32]):
λ1 = 410nm (blue)
λ2 = 530nm (breen)
λ3 = 650nm (red)
June 4, 2012 19
By superpositions of those wavelengths most but not all colors can be generated (=
ˆ triangle in CIE
diagram).
Note: Mixing of colors (i.e. superposition) in “wave length space” is additive as opposed to mixing
of color pigments for printing (subtractive color mixing).
When normalizing the maximum intensity in the color channels to 1 (minimum is 0, as no “negative”
frequency components possible) one obtains the following RGB color cube:
H = Hue (“color tone”), roughly proportional to the average wagelength of a mixture of primary
colors
I = Intensity
The HSI model defines HS color planes (either of circular or hexagonal shape) perpendicular to the
intensity axis of the model ⇒ H, S, I are usually given in cylindrical coordinates.
Slide: HSI color space (Fig. 26)
... used for PAL television and slightly modified (e.g. YCbCr with Cb, Cr beeing scaled versions of
U, V) in computer component video.
June 4, 2012 20
YUV can be computed from (normalized) RGB as follows:
Remarks:
• U and V are (appropriately scaled) differences to original B (“blue”) and R (“red”) compoments.
... is used for producing colors in print, i.e. by using color pigments and not illumination sources ⇒
“subtractive” color mixing
Color pigments absorb certain spectral components of incident light and reflect only some remaining
wavelength components (e.g. “red” pigment absorbs “green” and “blue” components and reflect only
“red” part of spectrum).
Prinary colors for “subtractive” color mixing (better: mixing of coloring substances [i.e. pigments]):
CMY model
M = magenta (≈ purple) =
ˆ W −G
Primary colors “subtract” one of the primary colors of the additive RGB model from white light (W ).
Mixtures can, therefore, be calculated with respect to the RGB model:
C ⊕ M = (W − R) + (W − G) =
=W + W} −R − G =
| {z
= W −R−G=
= (R + G + B) − R − G = B
Note: For the purpose of producing printed documents the CMY model is often realized as CMYK
with an additional K for black pigment, as C ⊕ M ⊕ Y usually does not produce acceptable
black for e.g. typesetting text.
June 4, 2012 21
2.4 Digitization
... comprises two processing steps, namely sampling of an image (in general: some analog signal) in
the spatial (or time) domain and quantization of the samples (=
ˆ measurements).
2.4.1 Sampling
(for images) ... is the measurement of the “content” of an analog image (e.g. the intensity) at the
discrete points of a 2-dimensional grid.
The topology of the grid is defined by the sensor arrangements of the imaging device used (Sect. 2.5).
Theoretical Result: The so-called sampling theorem states that sampling a continuous signal can
be achieved without loss of information, if sampling frequency is at least twice as high as the
highest frequency component present in the signal.
2.4.2 Quantization
... is the mapping of analog/continuous samples/measurements onto a discrete (especially finite) set
of values.
Q
[fmin, fmax ] −→ {b0 , b1 , b2 , ...bL−1 }
June 4, 2012 22
Quantisation of continuous samples (after nach [Nie03, S. 68])
Note: Quantization necessarily introduces errors into the digitization process ⇒ exact reconstruction
of original signal no longer possible.
Remarks:
Note: The characteristic curve of the quantization need not be linear (nonlinear characteristic
useful for, e.g., quantization of X-ray images due to nonlinearity of absorption by tissue).
– For grey level images B = 8 bits (i.e. L = 28 = 256 discrete intensities) yield gut
subjective reproduction of the image.
– For color images B = 8 bits per color plane (R, G, B).
A grid point (i.e. measurement point) in a digital image is called pixel (or pel = picture element). Its
position in the M × N image array is specified by the row and column index x = 0, 1, ...M − 1 and
y = 0, 1, 2, ...N − 1 (Beware: Indices do not correspond to actual spatial positions!).
Note: When representing digital images mostly the so-called upper-left coordinate convention is
used, i.e. the origin of the image matrix lies at the upper left corner of the image.
June 4, 2012 23
Note: Upper-left convention defines no right-handed coordinate system.
⇒ computing angles affected!
For segmentation (e.g. separating [foreground] objects from background) the definition of pixel
neighborhood on the image matrix/grid is important. Definition of connectivity is affected by this
choice.
Slide: Different definitions of pixel neighborhoods and effect on connectivity (Fig. 29)
Note: Problem with neighborhood variants and resluting connectivity can be resolved using a hexag-
onal image/sampling grid (found in some digital cameras).
Most widely used type of digital camera: CCD camera (CCD = charge coupled device).
• A CCD sensor uses a rectangular grid of photosensitive elements (of some finite size).
• Charges of individual sensor elements are “read out” – i.e. moved out of the sensor array – by
using charge coupling (a row at a time).
• In order to build up charge in sensor elements some time of exposure to incident light has to be
allowed.
• E.g. for video applications image contents are read at some fixed rate (25 Hz for PAL, actually
reading odd and even rows separately at a rate of 50 Hz ⇒ motion can cause distortions).
• Capturing “color” requires sensors to be made sensitive to different parts of the electromagnitic
spectrum:
– Using color filters in front of a single CCD chip and G R G R G R
B G B G B G
reading R, G, B images sequentially
G R G R
B G B G
– Using a beam splitter and color filters with 3 different
G R G R
G R
June 4, 2012 24
Chapter 3
Preprocessing
Goal: “Preparation” of images such that better results are achieved in subsequent processing steps
(e.g. segmentation).
Preprocessed images are better suited for future processing.
⇒ image enhancement
Note: There is no such thing as the preprocessing operation! (Techniques are higly application de-
pendent)
No in-depth treatment of preprocessing techniques! Only principle idea and typical example methods!
3.1 Normalization
Problem: Images/objects in images usually have parameters that vary within certain intervals (e.g.
size, position, intensity, ...). However, results of image analysis should be independent of this
variation.
Goal: Transform images such that parameters are mapped onto normalized values (or some appro-
priate approximation).
3.1.1 Intensity
25
• Normalization to Standard interval [0, a], e.g. [0, 255]:
Transform original grey valye fij into normalized value hij according to:
a(fij − fmin)
hij = where fmax = max fij , fmin analogously
fmax − fmin ij
1 X 1 X
µ= P fij σ2 = P (fij − µ)2
ij 1 ij ij 1 ij
fij − µ
hij =
σ
Note: Global normalization of intensity usually simpler but less effective than more complicated
local normalization!
... most “popular” normalization method based on the grey level histogram of an image given by:
(
X 1 if x = 0
h(g) = # of grey values g in image = δ (fij − g) where δ (x) =
ij
0 otherwise
From the histogram the (estimated) probability of a pixel having a certain grey level g can be derived:
1 1
p(g) = P h(g) = h(g) for M × N images
ij 1
MN
Goal: Use complete dynamic range of available grey levels within an image in order to resolve small
but frequent differences better.
Idea: Grey level intervals of high density in the histogram should be “stretched” and those with low
density should be “compressed”.
⇒ In general for the discrete case the number of grey levels is reduced by “compression”.
June 4, 2012 26
Method Let the cumulative distribution function of the grey values be:
f f
X X 1
H(f ) = p(g) = h(g)
g=0 g=0
MN
(where L is the number of grey levels available and ⌊x⌋ represents the largest integral number that is
equal or less than x)
achieves an approximate equalization of the grey level histograms, i.e. the histogram of the trans-
formed grey levels is approximately equal to a uniform distribution.
Note: In the discrete case the equalized histogram will in general not be equal to a uniform distribu-
tion. This can be achieved for the contiuous case only!
3.2 Filtering
Image filtering comprises image transforms that work on a certain neighborhood of pixels (usually
rectangular or quadratic area centered at the pixel in question) for deriving a new grey value.
Simple formalization of filtering:
• Move a window (which has the size of the neighborhood considered) from point to point in the
image
• Calculate the response of the filter at each point by applying some operation to the pixel grey
values within the window.
⇒ Filters are implemented using local image operations, realize locally defined image transforms!
June 4, 2012 27
3.2.1 Linear Filters and Convolution
An expecially relevant case of filters are linear (the transform satisfies T {af +bg } = aT {f }+bT {g}),
shift-invariant (transform is independent of pixel posision) filters.
Linear filters can be realized as follows:
• Define a filter mask with the size of the neighborhood and filter coefficients/weights w(s, t)
assigned to each point of the mask.
Slide: Principle of filtering using masks (Fig. 35)
• Calculate the filter response by a weighted sum of pixel grey values and mask coefficients.
For a 3 × 3 mask the filter response is given by:
g(x, y) = w(−1, −1)f (x−1, y −1)+w(−1, 0)f (x−1, y )+...+w(0, 0)f (x, y)+...w(1, 1)f (x+1, y+1)
(Note: Coefficient w(0, 0) coincides with f (x, y), i.e. mask is centered at position (x, y))
In general the result of applying a linear filter w() of size m × n (with m = 2a + 1 and n = 2b + 1)
to an image f () of N × M pixels is given by:
a b
X X
g(x, y) = w(s, t)f (x + s, y + t)
s=−a t=−b
Note: This formulation is equivalent to computing the cross-correlation between the mask and the
image, which is similar to the concept of convolution:
XX XX
h = f ∗ g ⇔ h(x, y) = f (s, t)g(x − s, y − t) = f (x − s, y − t)g(s, t)
s t s t
For convolution either the signal f () or the mask w() need to be mirrored along the x- and
y-Axis, which is unproblematic as masks are frequently symmetric.
• Image Averaging
1
Use m × n window with weights w(s, t) = mn
• Gaussian Smoothing
Required: Filter coefficients for discrete 2D approximation of Gaussian
June 4, 2012 28
discrete 1D-Gaussian can be obtained from the rows in Pascal’s Triangle:
1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
3 × 3 Gaussian filter mask obtained from
1 1 2 1
2 × (1 2 1) = 2 4 2
1 1 2 1
Note: In order not to amplify the image content masks are usually normalized by the sum of their
coefficients (for Gaussian: 161 ).
Motivations:
• Better understanding of effects of filtering on images (only with more mathematical details!)
• Computes a frequency representation of a signal/image (Note: Both f and its transform F are
complex!)
• Input signals are roughly approximated by sine and cosine functions of different frequencies
and amplitudes (Note: e−ix = cos x + i sin x).
• Assumes discrete periodic input ⇒ finite images are treated as if repeated periodically!
Note: Example shows that averaging filter (rectangular shape spatial response) is no good smooting
operation as effects are infinite in frequency domain!
⇒ Betters solution: Gaussian, as it is form-invariant under Fourier transform!
June 4, 2012 29
Important property of the Fourier transform:
Convolution in the spatial domain is converted to multiplication in the frequency domain and vice
versa:
f = g∗h⇔F = G·H
⇒ Convolution operations can be computed more efficiently in frequency domain! (Note: For fil-
tering additionally computation of forward and backward transform necessary ⇒ beneficial only for
large filter masks!)
... in general non-linear spatial filters whose response is based on an ordering (ranking) of intensity
values within the neighborhood considered.
Given a neighborhood V (x, y) = {f (x + s, y + t)| − a ≤ s ≤ a ∧ −b ≤ t ≤ b}
Order pixel values (i.e. intensities) as follows:
h(x, y ) = φ{R(x, y )}
Simple Examples:
0 2 4 6 → R(x + 1, y) = {2, 3, 4, 4, 5, 5, 6, 6, 6}
V (x, y) = 1 4 5 6 3, 4, 4, 5, 6}}
→ R(x, y) = {0|, 1, 1, 2,{z
9×
1 3 6 5 r(K+1)/2 = 3
0 0 1
V (x, y) = 0 1 1 1, 1, 1, 1, 1}}
→ R(x, y) = {0|, 0, 0, 0,{z
9×
0 1 1 r(K+1)/2 = 1
June 4, 2012 30
The following “well known” rank-order / order-statistics filters/operations can be defined:
h(x, y) = r1 Erosion
h(x, y) = rK Dilation
h(x, y) = r(K+1)/2 Median
h(x, y) = rK − r1 Edge detection (aka “morphological edge”)
(
r1 if f (x, y) − r1 < rK − f (x, y)
h(x, y) = Edge sharpening
rK otherwise
Remarks:
• Median as a smooting operation preserves contrast/edges (in contrast to e.g. averaging) but
removes “salt-and-pepper” noise (i.e. small “errors”).
Slides: Example images for averaging, median (Fig. 40), erosion/dilation, morphological edge
(Fig. 41), and opening/closing (Fig. 42)
June 4, 2012 31
Chapter 4
... features (numerical values!) of individual pixels or pixel neighborhoods that are relevant for image
segmentation and/or interpretation.
For segmentation boundaries between regions of images are especially relevant, which become man-
ifest by discontinuities in the image.
Local discontinuities =
ˆ (usually) difference in image intensity/grey level
Remarks:
• Definition of “difference” more complicated for color images (i.e. multi-channel images)
Note: Usually pixels where relevant grey level differences can be observed are called edge pixels or
edge elements (edgels) vs contours, i.e. boundaries of regions.
Discrete Differentiation
32
Slides: Examples of edge types (ideal, ramp) (Fig. 43), and behaviour of derivatives (Fig. 44)
⇒ Computing local differences (by some method) for every pixel yield new, i.e. an edge image!
2 commonly used methods:
a) Gradient
For 2D signals (e.g. images) f (x, y) the gradient
! !
∂f (x,y)
∂x fx
~g (x, y) = ∂f (x,y) =
∂y
fy
represents the direction and the magnitude of a change in image intensity at position (x, y).
The magnitude of the gradient can be computed as:
q
gmag = |~g (x, y)| = fx2 + fy2
fy
gdir = arctan
fx
Note: Assumes cartesian coordinate system, needs to be adapted for upper left image coordi-
nates!
For discrete signals (e.g. images) partial derivatives fx , fy need to be approximated by local
differences ∆i f, ∆j f . For computing those the following possibilities exist:
f (i, j) − f (i − 1, j)
backward gradient
∆i f = fi = f (i + 1, j) − f (i, j) forward gradient
f (i + 1, j) − f (i − 1, j) symmetric gradient
b) 2nd Derivative
... not grey-level difference but change in grey-level curvature is interpreted as an edge.
Potential advantage: “wide” edges can be suppressed if only fast transition from positive to
negative curvature is considered as an edge.
June 4, 2012 33
Usually approximated for discrete signals (images) by:
∂ 2f ∂ 2f
▽2 f (x, y) = +
∂x2 ∂y 2
which can be approximated in the discrete case by
2
▽2 f = ∆2ii f + ∆jj f = 4f (i, j) − (f (i + 1, j) + f (i − 1, j ) + f (i, j + 1) + f (i, j − 1))
Note: Depending on the definition used the sign of the Laplacian may be inverted (as in the
definition above which follows the notation used in [Gon02]!
Problem: Applied in isolation both the gradient and the Laplacian are very sensitive to noise in
images!
Slide: Example of ramp edge corrupted by Gaussian noise with increasing variance showing
effect in 1st and 2nd derivatives (Fig. 45)
Remarks:
– Orientation of filter masks depends on orientation of coordinate axes (here: upper left!).
– Variants exist for detecting diagonal edges.
Slide: Example of gradient in x- and y-direction and combined magnitude (Fig. 46)
• Laplacian operator:
June 4, 2012 34
Remarks:
Note: As gradient/Laplacian are very sensitive to noise usually operators combining edge detection
and smoothing are applied.
• Sobel Operator
Slide: Example of Sobel operator, Laplacian and Laplacian smoothed with a Gaussian (LoG) (Fig. 48)
[For03, p. 189]
Problem: Appearance of surfaces or boundaries between object surfaces can not be described by
characteristics of a single pixel (e.g. grey-level or color)!
⇒ (local) neighborhood of pixels needs to be considered
June 4, 2012 35
Problem: How can local characteristics of textured surfaces be represented?
⇒ Principle approaches:
Goal: (Statistical) Description of relations between pixels’ grey values in a local neighborhood.
Assumption: Texture can be characterized by frequencies, with which two pixels with given grey
value occur in some given (relative) distance and with given orientation.
f (x, y) = i
Remarks:
• If the total number of grey levels in an image is L one obtains L × L co-occurrance counts
ni,j (d, α).
• For counting grey level co-occurrances a local neighborhood of given size must be specified
(How to treat image boundaries?).
Definition:
The normalized grey-level co-occurrance matrix (GLCM) G(d, α) is defined as:
ni,j (d, α)
G(d, α) = [gi,j (d, α)] = L−1 L−1
P P
ni,j (d, α)
i=0 j=0
Note: A single GLCM is not sufficient for describing a texture (only a single distance and orientation
parameter is used)!
⇒ Matrices need to be computed for several different distance and orientation parameters!
June 4, 2012 36
Example:
0 1 2 3
0 2 2
Neighborhood: 1 2 1
G(1, 0◦ ) × 12
2 2 2
0 0 3 3
3 2
0 0 3 3
2 2 1 1
0
2 2 1 1
1 4
×81
G(2, 90◦ )
2 4
(boundary treatment: truncation)
Problem: GLCMs considerable increase the parametric representation of textures (multiple matrices
for different d and α required)!
⇒ Calculate features derived from GLCMs
a) Energy
L−1 L−1
X X
2
e(d, α) = gi,j (d, α)
i=0 j=0
– in regular textures only few strong/prominent grey-level differences occurr ⇒ only few
large gi,j (d, α) ⇒ summation of squared values large
– otherwise many/all grey-level differences exist ⇒ many small gi,j (d, α) ⇒ summation of
squared values small
b) Contrast
L−1 L−1
X X
c(d, α) = (i − j )2 gi,j (d, α)
i=0 j=0
June 4, 2012 37
– large differences are weighted strongly: (i − j )2
c) Homogeneity
L−1 L−1
X X 1
h(d, α) = gi,j (d, α)
i=0 j=0
1 + |i − j|
Basis: (Fourier) Spectrum decomposes image into periodic funcions (sines and cosines) of different
wavelength and amplitude.
Idea: Capture peroidic grey-level patterns within textures by frequency based representations.
Note: Similarly to GLCMs not the raw spectrum but features derived from it are used for represent-
ing/characterizing textures!
b) Location of prominent peaks in the frequency plane ⇒ represents fundamental (spatial) period
of texture patterns
Extracting such features can be simplified by expressing the spectrum S(u, v) in polar coordinates as
a function S(r, θ).
For each radius r or angle θ the spectrum may be considered as a 1D function Sr (θ) or Sθ (r), respec-
tively.
More global representations can be obtained by integration (acually summation in the discrete case)
of those functions: π
S(r) = Sθ (r)
θ=0
X
June 4, 2012 38
and
R0
X
S(θ) = Sr (θ)
r=1
(with R0 is maximal radius of a circle centered at the origin; because of symmetry only angles between
0 and π need to be considered)
Slide: Periodic textures and derived frequency-based representations (Fig. 50)
When an object within a scene moves relative to the position of the camera (or when the camera
moves relative to the scene) the 2-dimensional projection onto the image plane f (x, y) is additionally
dependent on the time t, yielding f (x, y, t).
Motion in the image plane (!) can be estimated by finding corresponding pixels in subsequent images
(=
ˆ displacement wrt. x/y-direction).
Note: Motion in 3D can only be estimated if, additionally, depth information is available!
Potential Methods:
Idea: Infer information about motion within a scene from changes in grey-level structure.
• ... the orientation of surfaces relative to illumination sources is invariant (or the illumina-
tion is uniform) ...
• ... and the orientation relative to the observer is constant (or the grey-level is independent
of a change in orientation).
June 4, 2012 39
Image at time t ... and at time t + dt
dy
dx
f(x, y, t) (x, y)
f(x + dx, y + dy, t + dt)
An expansion of the expression on the right hand side into a Taylor series at the point (x, y) yields:
∂f ∂f ∂f
f (x, y, t) = f (x, y, t) + dx + dy + dt + {residual}
∂x ∂y ∂t
When ignoring terms of higher order within the Taylor expansion (i.e. the “residual”) one obtains
(with fx = ∂f
∂x
etc.):
fx dx = fy dy + ft dt = 0
Problem: Optical flow (i.e. velocities u and v) is not uniquely defined by the motion constraint
equation Em .
However, assuming that every point in the image could move independently is not realistic.
Constraint: Usually opaque objects of finite size undergoing rigid motion are observed ⇒ neighbor-
ing points in the image will (most likely) lie on the same object (surface) and, therefore, have
similar velocities.
⇒ Optical flow is required to be smooth
June 4, 2012 40
This smoothness constraint can be expressed by minimizing (the square of) the magnitude of the
gradient of the flow velocities:
2 2 2 2
2 ∂u ∂u ∂v ∂v
Es = + + + → min!
∂x ∂y ∂x ∂y
The motion constraint equation and the smoothness constraint should hold for all positions in the
image. Due to violation of the assumptions and noise in the image in practice this will, however,
never be exactly the case.
Therefore, a combined constraint can be formulated as minimizing the following total error:
Z Z
2
E= (E m + α2 Es2 )dxdy
Using variational calculus one can show that a neccessary condition for an extremum (minimum) is
given by:
fx2 u + fx fy v = α2 ▽2 u − fx ft
fx fy u + fy2 v = α2 ▽2 v − fy ft
Note: The Laplacian ▽2 u (or ▽2 v) can be approximated in the discrete case as:
▽2 u = β(ūi,j,k − ui,j,k)
with ūi,j,k beeing a suitable average over some local neighborhood, e.g. 4-neighborhood:
1
ūi,j,k = (ui−1,j,k + ui,j−1,k + ui+1,j,k + ui,j +1,k )
4
The proportionality factor β then needs to be set to 4.
Further rearanging of terms yields (after some lengthy derivations) the following form, which defines
u, v based on the gradients fx , fy , ft and the local averages ū, v̄:
fx (fx ū + fy v̄ + ft )
u = ū −
α2 β + f x2 + fy2
f (f ū + fy v̄ + ft )
v = v̄ − y 2x
α β + fx2 + fy2
Remarks:
June 4, 2012 41
• Strucutre of equations can be used to define an interative procedure for computing estimates of
the flow velocities, i.e. by calculating new estimates (un+1, v n+1 ) from estimated derivatives
and local averages of previous velocity estimates:
fx (fx ūn + fy v̄ n + ft )
un+1 = ūn −
α2 β + fx2 + f 2y
f (f ūn + fy v̄ n + ft )
v n+1 = v̄ n − y x2
α β + fx2 + fy2
(Procedure can be initialized by assuming a uniformly vanishing flow field, i.e. velocities u =
v = 0.)
Problem: When mapping a 3D scene onto a 2D image (plane) information about depth (i.e. distance
of a scene point from the image plane) is lost!
(Note: The first two methods are active, the third one passive.)
Principle: Two (or more) images/views of a scene are captured from different positions/view-points.
A scene point it thus mapped onto two corresponding image points PL and PR in the two stereo
images (if it’s not occluded).
If the geometry of the camera configuration is known, depth information can be recovered from
the position of the corresponding image points.
Problem:
June 4, 2012 42
Simplified Stereo Setup
⇒ relevant parameters:
The following relations between camera (3D) and image coordinates (2D) can be obtained:
! ! ! !
xL f xC xR f xC + b
= and =
yL zC yC yR zC yC
The distance of corresponding image points is called disparity (obtained from the above relations):
fb
d = xR − xL =
zC
June 4, 2012 43
4.4.2 Estimation of Stereo-Correspondence via Optical Flow
Let’s consider a trivial image sequence consisting of left and right image by defining:
Using the simplified stereo setup (as of above) one obtains the following constraint equation:
Ec = fx u + ft = 0
(with ft beeing the partial derivative of f () in the “direction of the stereo setup” and u corresponding
to an estimate of the disparity d).
Solution: Use so-called “directed” smoothness, which considers changes in the displacement field
perpendicular to the grey-level gradient.
Ed s2 = (fy ux − fx uy )2 → min!
All constraints can be integrated when minimizing the following total error:
Z Z
E= Ec + α2 (Es2 + β 2 Ed s2 )dxdy
June 4, 2012 44
Chapter 5
Image Primitives
The basis for recognizing objects in images / describing scenes is usually formed by more elementary
“components” – so-called image primitives.
Goal: Partitioning of images into meaningful primitives that form the basis for the extraction of
objects of interest (for a specific application) within (simple or complex) scenes.
5.1 Fundamentals
The most widely used (i.e. “established”) image primitives result from image segmentation, where
the goal is to segment a given image into a set of regions or contours.
• Regions =
ˆ homogeneous image areas (roughly corresponding to objects or object surfaces)
45
Note: The process of segmenting an image into regions and contours is approximately dual, i.e.
region boundaries are approximately equivalent with contours and contours enclose regions.
A rather “modern” type of image primitives are so-called keypoints (or interest points). A set of key-
points defines a sparse representation (wrt. segmentation) of image content by extracting “interesting”
points/positions within an image, usually augmented by a local feature representation.
Assumptions:
• A region is a homogeneous image area ⇒ A predicate P can be defined, that is true for one
specific region and false if the region is extended by neighboring pixels (=
ˆ homogeneity crite-
rion)
S
1. i Ri = I, i.e. the set of regions Ri segment (or describe) the whole image I
3. Ri is topologically connected (i.e. for every pair of pixels in Ri there exists a connecting path
of neighboring pixels within Ri )
Note: A homogeneity criterion can be defined on any feature of a pixel i.e. grey level, color, depth,
velocity, ...
June 4, 2012 46
5.2.1 Region Growing
Basic Method: Starting from “suitable” initial pixels regions are iteratively enlarged by adding neigh-
boring pisxels.
Problems:
5.2.2 Splitting
Idea: Starting from a single initial region (i.e. the whole image) regions are subsequently subdivided
according to some scheme until all regions satisfy the homogeneity criterion P .
Principle Method:
Possible Solution:
June 4, 2012 47
5.2.3 Split-and-Merge
Motivation: Exploit advantages of region splitting and growing (respectively merging) methods by
combining both techniques
Basic Idea: Starting from an initial segmentation suitable splitting and merging operations are ap-
plied until a final segmentation is reached which satisfies all criteria.
Method (after Pavlidis): Split and merge operations operate on a quad-tree representation of the
image
Slide: Quad-tree representation of image segmentation (Fig. 53)
2. Perform all possible region merging operations (within the quad-tree structure!)
(i.e. eventually, 4 nodes in the quad-tree will be replaced by their parent node)
Slide: Result of possible merge operations on the sample image (Fig. 56)
3. Perform all possible regions splits, for regions (i.e. quad-tree nodes) that don’t satisfy the
homogeneity criterion P (Ri )
Slide: Result of possible splitting operations on the sample image (Fig. 57)
4. Merge adjacent regions that - when merged - satisfy the homogeneity criterion P (Ri ∪ Rj ).
Note: In this step the algorithm goes beyond the quad-tree structure of the image.
Slide: Result of merging operations outside the quad-tree structure for the sample image (Fig. 58)
5. Eliminate very small regions by merging them with the neighboring region with most similar
grey value.
Slide: Final segmentation for result the sample image (Fig. 59)
Baseline: Edge detection (i.e. contour extraction starts from edge images)
June 4, 2012 48
Note: Methods for edge detection generate after the application of a suitable thresholding op-
eration a set of edge pixels where significant changes in grey level occur (per pixel a
measure for the strength and possibly direction of the edge).
Problems:
Goals: • Extraction of significant edge elements (ideally exactly one edge pixel in
the direction of the contour),
Note: Linking & Approximation can also achieved in an integrated manner via the Hough tranformր.
Thinning
June 4, 2012 49
1. Non-maximum suppression
Basic Method: Eliminate edge elements if the edge intensity is smaller than that of another edge
element in the direction of the gradient (i.e. perpendicular to the edge/potential contour, two
neighbors must be considered)
2. Hysteresis thresholding
Idea: Thresholding with a single threshold usually produces bad results ⇒ use two thresholds!
Basic Method:
Linking
Idea: Start with a randomly chosen edge element and extend sequence of linked edge elements by
one neigbor at a time. Repeat until all edge elements are covered by some linked set.
Problems:
• Contours counld be broken up into two (partial) chains of edge pixels (if started “in the
middle”).
⇒ extend edge pixel chain in both directions
• At junctions multiple continuations for building edge pixel chains are possible.
⇒ Continue contour in direction with most similar gradient.
June 4, 2012 50
Approximation
Goal: Approximation of linked sets of edge elements by a parametric function (e.g. line, circle, ...)
... does not try to approximate given edge pixels by a parametric representation but investigates the
parameter space of approximations of all edge elements.
June 4, 2012 51
Baseline: A straight line can be represented with parameters
α and r as follows:
r = x cos α + y sin α
{
r
}|
r = xi cos α + yi sin α
r
This expression can be considered as a function of α. It then for given (xi , yi)
represents an sinusoidal curve in the α-r -plane.
Method
– For all edge pixels (xi , yj ) “draw” the associated curve (or function) in the α-r -plane.
– The intersection points of all curves drawn represent parameters (αk , rk ) of lines which
approximate the associated edge pixels.
June 4, 2012 52
Simple example:
r
√
2·M
(x2 , y2 )
(x1 , y1 )
rs (x3 , y3 )
α
0 αs π
√
− 2·M
⇒ The line rs = x cos αs + y sin αs passes through the points (x1 , y1 ), (x2 , y2 ), and
(x3 , y3 ).
Problem: Determining intersection point mathematically exactly in practice too difficult or
unusable (slight deviations due to noise, quantization, and errors in the edge detection
process).
Solution: Just as the image plane the α-r-plane needs to be represented digitally, too. There-
fore, the ranges of α and r are quantized resulting in an α-r-matrix. The cells of this
matrix can be thought of a accumulators that are incremented whenever a sinusoidal curve
for some edge pixel passes through the accumutor’s parameter range.
Reconstruction of Contours: Salient contours in the image are found by selecting α-r pa-
rameter pairs with high accumulator counts and checking the associated edge pixels for
continuity.
Slide: Example of the Hough transform on sample infrared image (Fig. 62)
June 4, 2012 53
b) Parametric Case
The Hough transform can be applied in all cases where a parametric representation of the curve
used for approximating contours is possible in the form:
g(x, y, c) = 0
(x − c1 )2 + (y − c2 )2 = c32
5.4 Keypoints
• a method for describing the local image properties at the keypoint location as uniquely as pos-
sible for later retrieval.
Goal: Find keypoint locations and appropriate descriptions that are (approximately) invariant with
respect to scale, rotation and — to some extent — view-point changes.
Solutions:
June 4, 2012 54
Scale-Space Representation of Images
Different scale representations L(x, y, σ) of an image I(x, y) are obtained via a convolution with a
Gaussian G(x, y, σ) with variable scale (i.e. standard deviation) σ, where
and
1 −(x2 +y2 )/2σ 2
G(x, y, σ) = e
2πσ 2
The scale space of an image I(x, y) is then defined by the sequence of Gaussian-smoothed versions
L(x, y, kn σ) for some σ, a constant multiplicative factor k and n = 0, 1, 2, ....
Note: With every doubling of the scale kn σ (i.e. every so-called octave) the scale space image
L(x, y, kn σ) can be subsampled by a factor of 2 without loosing information.
Keypoints are detected as the local extrema in the Difference of Gaussian (DoG) representation of
scale space, i.e.:
Keypoint candidates are defined as the maxima or minima in the DoG images by comparing a pixel
position to its neighbors in 3 × 3 regions in the local and both adjacent scales.
Slide: Determining extrema in DoG image representations (Fig. 64)
Keypoint locations are determined with sub-pixel accuracy by fitting a 3D quadratic function to the
local sample points and determining the location of the interpolated extremum.
Furthermore, keypoint candidates are discarded that:
• lie on edges.
June 4, 2012 55
Keypoint Orientation
After interpolation a keypoint can be associated with the Gaussian-smoothed image L(x, y, σ) at the
scale σ closest to the keypoint.
The local orientation at the keypoint (x0 , y0 ) is then determined from a histogram of gradient orien-
tations (resolution of 10 degrees, i.e. 36 bins, weighted by gradient magnitude), which is computed
in the region around the keypoint over the gradients of L(x, y, σ) (weighted by a circular Gaussian
window with σ ′ = 1.5σ ).
⇒ Local maxima (peaks) in the histogram correspond to dominant local orientations.
Keypoints are created for all local maxima in the gradient histogram which are within 80% of the
global maximum (i.e. one candidate location might be used to create multiple keypoints).
After keypoint creation peak positions are interpolated by a parabola over three adjacent bins in the
orientation histogram.
Keypoint Descriptor
• (Image) Gradient magnitudes & orientation are sampled around the keypoint (at scale of key-
point, weighted by Gaussian)
• Orientation histograms are created over m × m sample regions (e.g. 4 × 4 regions with 8 × 8
sample array)
Note: Keypoint descriptors can be matched (nearest neighbor) accross images in order to identify –
to some extent – identical/similar locations at different scales/rotations/view points.
June 4, 2012 56
Chapter 6
In contrast to object recognition based on image primitives (regions, contours) and some method for
finding appropriately structured configurations of those that correspond to objects, in appearance-
based approaches objects are represented – more or less – by using image data directly (e.g. different
views of an object).
Known instances of objects can then be identified by matching the image-based representations to
new data. Using an appropriate similarity criterion also objects from an object category can be found.
• Build “model” of an object by extracting a (rectangular) image area (the template f ) showing
the desired object.
Note: Also multiple templates (=
ˆ views) per object can be used.
• For finding (this object, a similar one) in a new image g compute “similarity” between the
template and the new image at every position in the image.
Common similarity measure: cross-correlation
M −1 N −1
1 XX
h(x, y) = f (x, y ) ◦ g(x, y ) = f (m, n)g(x + m, y + n)
MN m=0 n=0
Note: As the cross-correlation measure not only depends on the similarity between template and
image but also on the local brightness of the image, an appropriate normalization is necessary
in practice (see the tutorialsր).
57
6.2 Matching Configurations of Keypoints cf. [Low04]
Note: Matching of single keypoints not reliable enough for object detection / recognition
Basic Idea: Match sets of keypoints (lying on desired object) and verify correct spatial configuration
of matches
1. Model of object:
(a) Set of keypoints (specified by their descriptors [including local orientation and scale!])
(b) Reference point on object
(c) For each kepoint: Vector to reference point relative to keypoint orientation and scale
2. Matching process:
(a) For every matching keypoint determine reference point candidate (exploit local scale
& orientation of matched keypoint)
(b) Vote for all reference point candidates associated with keypoint matches
(c) Reference point(s) with highest number(s) of votes determines location of object hy-
pothesis/es
Idea: Relevant information for representing a class/set of known objects should be automatically
derived from sample data.
Basic Abstraction: Sample images are considered as points in a high-dimensional vector space (of
images).
Goal: Find suitable representation of the “point cloud” in the image space that represents the known
objects.
Derive appropriate similarity measure for finding known object instances or objects from the
modeled category.
Most well known example: Eigenfaces, i.e. application to the problem of face detection/identification
June 4, 2012 58
• Representation of sub-space possible via center of gravity (i.e. mean vector) of samples and
sample covariance matrix
⇒ Principle components of covariance matrix (i.e. its eigenvectors) span sub-space
• Every sample (i.e. vector in the sub-space) can be reconstructed via a linear combination of the
eigenvectors
Principle Method
Building the Model:
1) Collect a set of sample images from the class of objects to be modeled (e.g. face images)
Slide: Example of sample set of face images (Fig. 68)
2a) Compute the principal components (Eigenimages) of the sample set and
Slide: Example of Eigenfaces obtained (Fig. 69)
2b) Select the K eigenvectors corresponding to the largest eigenvalues for representing the data
3) For all known instances (individuals) compute the projection onto the modeled sub-space (which
is spanned by the selected eigenvectors).
⇒ on K-dimensional vector of weights per instance
Note: Modeling quality can be assessed by inspecting reconstructions of the known data.
June 4, 2012 59
4) For an unknown image (e.g. a new face image) compute its distance to the modeled sub-space
(e.g. face-space).
⇒ Reject (i.e. none of the known objects) if too large.
Note: When searching a larger image for smaller realizations of the known objects many pos-
sible sub-images at different scales need to be considered!
Slide: Example of face/non-face classification using image reconstruction via Eigenfaces (Fig. 70)
5b) Classify the resulting vector as known or unknown instance given the projections of the (known)
sample data (e.g. by mapping to the nearest neighbor in the projection space).
The deviation of a face image from the average is, therefore: Φn = Γn − Ψ (i.e. the set
Φ = {Φ1 , Φ2 , ...ΦN } corresponds to face images normalized to have zero mean).
• For the set of difference images Φ perform a Principal Component Analysis (PCA)
⇒ set of K < N (maximum value of K = N −1 if all Ψ are linearly independent) orthonormal
vectors ui , u2 , ...uK
i.e. (
1 if k = l
ukT ul =
0 otherwise
(where {·}T denotes vector/matrix transpose) and
N
1 X
λk = (ukT Φn )2
N | {z }
n=1
projection of Φn ontouk
is maximum.
June 4, 2012 60
• The vectors uk and scalars λk are eigenvectors and eigenvalues, respectively, of the covariance
C matrix of the normalized images:
N
1X
C= Φn ΦTn = AAT with A = [Φ1 , Φ2 , ...ΦN ]
N
n=1
Problem: The covariance matrix C has dimension M 2 × M 2 (i.e. for e.g. 512 × 512 images
5122 × 5122 = 262144 × 262144)
⇒ computation of eigenvalues/vectors extremely problematic
Solution: As there are only N images in the training set, that is used for computing C, a
maximum of K = N − 1 eigenvectors of C exist (N − 1 << M 2 ).
Compute eigenvectors of AT A (dimension: N × N )
AT Avi = µi vi |A · [...]
T
AA Avi = Aµi vi = µi Avi
|{z}
• A known face image Γ can now be represented by its projection onto face space:
ωk = uTk (Γ − Ψ) k = 1...K
• Any image can be reconstructed (in general only approximately) from its projection onto face
space
With Φ = Γ − Ψ the reconstruction of Φ in face space is:
K
X
Φ̂ = ωk uk
k=1
• Non-face images can be rejected based on their distance to face space (i.e. the quality of the
reconstruction):
dface = ||Φ − Φ̂||2
June 4, 2012 61
• Face images of known (or possibly unknown) individuals can be identified based on the simi-
larity of their projections to those of known individuals from the training set.
Simple method: chose individual whose projection Ωj has minimum euclidian distance to pro-
jection Ω of query image:
j = argmin ||Ω − Ωk ||2
k
Slide: Processing steps necessary for implementation of the Eigenface approach (Fig. 71)
June 4, 2012 62
Chapter 7
Tracking
after [For03, Chap. 17]
7.1 Introduction
When objects move in a scene in general a sequence of images is required in order to draw inferences
about the motion of the object. The situation is similar when the camera (i.e. the observer) moves
through a scene. Then the motion of the observer can be inferred. This problem is known as tracking.
• Targeting (in the military domain): try to predict an object’s (target’s) future position in the
attempt to shoot it.
Note: Usually radar or infrared images are used.
• Surveillance: motion patterns of e.g. people on a parking lot are used to draw inferences about
their goals (e.g. trying to steal a car)
• Automotive: traffic assistance systems (e.g. for lane-keeping, adaptive cruise control) infer
motion of lane marks or other vehicles
• Motion capture: special effects in movies sometimes rely on the possibility to track a moving
person accurately and to later map the motion onto another - usually artificially generated -
actor
• some set of measurements from the image sequence (e.g. the object’s position estimate)
We will consider drawing inferences in the linear case only, i.e. the motion model and the measure-
ment are linear.
63
7.2 Tracking as an Inference Problem
• Prediction: from past measurements y0 , y1 , ...yt−1 predict interal state at t-th frame:
• Data association: from multiple measurements (e.g. position estimates) at frame t “select” the
“correct” one. Can be achieved based on the prediction of Xt .
Possible methods:
– Selecting the nearest neighbor, i.e. from several measurements ytk choose the one max-
imising P (ytk |y0 , ...yt−1 ).
– Perform gating (i.e. exclude measurements that are too different from the prediction) and
probabilistic data association (i.e. a weighted sum [according to the prediction probabil-
ity] of the gated measurements):
X
yt = P (ytk |y0 , ...yt−1 )ytk
k
• Correction: correct the internal state incorporating the new measurement, i.e. compute:
• Measurements depend only on the current state, i.e. not on other measurements taken:
June 4, 2012 64
7.3 Kalman-Filter
• All probability distributions are Gaussians, i.e. can be represented by their mean and associated
covariance matrix.
• The uncertainty of the prediction is described by the covariance matrix Σd (could be time
dependent).
• A measurement matrix M (could be dependent on time but usually is not) is used to convert
between internal state and measurements taken:
ŷt = M xt
• The uncertainty about the measurement process is represented by the covariance matrix Σm
(which could also be time dependent).
Note: The state vector Xt is normally distributed with mean x̂t and covariance matrix Σd . The
measurement vector Yt is normally distributed with mean ŷt and covariance Σm .
• (Quasi-)Stationary point: Internal state and measurements represent identical quantities, i.e.
M = I. Motion occurs only under random component, i.e. uncertainty of the measurement
(when this is assumed to be quite large, the model can be used for tracking if nothing is known
about the object’s dynamics).
pt = pt−1 + ∆t · v
June 4, 2012 65
The dynamic model is then given by
( )
I ∆t · I
D=
0 I
• Constant acceleration: analog to the above with additional acceleration parameter a as com-
ponent of the state vector.
Kalman-Filtering Algorithm
Goal: Estimate Gaussian probability distributions describing the linear dynamic model optimally in
the sense of least mean squared error.
Processing Steps:
Distinguish between state representation estimates before (e.g. x̂−t ) and after (e.g. x̂+
t ) the
incorporation of a new measurement yt .
−
0. Assume some initial estimates of x̂−
0 and the covariance Σ0 are known.
1. Predict new internal state xt from past state applying the dynamic model of motion:
x̂t− = D x̂t−1
+
Σt− = DΣ+ T
t−1 D + Σ
d
which represents the ration between the uncertainty of the model (Σt−) and the uncertainty
of the measurement process (M Σt−M T + Σm ).
The innovation
yt − M x̂t−
x̂+
t = x̂−t + Kt (yt − Mxˆ−
t )
Σ+
t = (I − Kt M )Σt−
June 4, 2012 66
Bibliography
[Hor81] B. K. P. Horn, B. G. Schunck: Determining Optical Flow, Artificial Intelligence, Bd. 17,
1981, S. 185–203.
[Low04] D. Lowe: Distinctive Image Features from Scale-Invariant Keypoints, Int. J. of Computer
Vision, Bd. 60, Nr. 2, 2004, S. 91–110.
[Nie90] H. Niemann: Pattern Analysis and Understanding, Bd. 4 von Series in Information Sci-
ences, Springer, Berlin Heidelberg, 2. Ausg., 1990.
[Ram72] U. Ramer: An Iterative Procedure for the Polygonal Approximation of Plane Curves, Com-
puter Graphics and Image Processing, Bd. 1, Nr. 3, 1972, S. 244–256.
[Tur91] M. Turk, A. Pentland: Eigenfaces for Recognition, Journal of Cognitive Neuro Science,
Bd. 3, Nr. 1, 1991, S. 71–86.
67
June 4, 2012
68
Figure 9: “Stanley” (left) and “Highlander” (right) with sensors (Source: DARPA)
76
Figure 10: Surveillance of persons arriving at the parking lot bevore IBM’s J. T. Watson research
center
June 4, 2012 77
June 4, 2012
Figure 12: Example of face detection results for strange people (by Yann LeCun)
79
June 4, 2012
Figure 13: What do all these objects – except one – have in common? [Jäh02, p. 17]
80
Figure 14: Gestalt laws (after [Sch01])
June 4, 2012 81
June 4, 2012
82
Figure 18: Overview of the electromagnetic spectrum; range of visible light enlarged
85
June 4, 2012
α1
β2 β1
α2 h
β2 γ β1
P2 P1
C lens radius
R
light ray
d2 d1
Figure 19: Paraxial refraction: A light ray through P1 is refracted at P (where it intersects the interface, i.e. the surface of the lens) and then intersects
the optical axis at P2 . The geometric center of the interface is C, its radius R, all angles are assumed small (after [For03, Fig. 1.8, p. 9]).
86
June 4, 2012
F′ O F
−y ′
f f
P′
z′ −z
Figure 20: A thin lens: Rays through O are not refracted, rays parallel to the optical axis are focused in F ′ . Also note the different in-focus image
points for object points at different distances (cf. [For03, Fig. 1.9, p. 10]).
87
June 4, 2012
2002
c R. C. Gonzalez & R. E. Woods
Figure 21: The human eye as an imaging system Schematic structure of the human eye (from [Gon02, Chap. 2])
88
2002
c R. C. Gonzalez & R. E. Woods
Figure 22: Schematic structure of the human eye (from [Gon02, Chap. 2])
June 4, 2012 89
June 4, 2012
2002
c R. C. Gonzalez & R. E. Woods
Figure 23: Distribution of rods and cones on the retina (from [Gon02, Chap. 2])
90
June 4, 2012
91
Figure 24: Response of cones to different wavelengths (from [Bal82, Chap. 2])
Figure 25: Chromaticity diagram (cf. [Gon02, Chap. 6])
June 4, 2012 92
June 4, 2012
93
2002
c R. C. Gonzalez & R. E. Woods
Figure 27: Additive color mixing in the RGB model vs. subtractive color mixing in the CMY model (from )
94
June 4, 2012
95
Figure 28: Coordinate convention used with digital images (after [Gon02, Chap. 2], Note: There coordinate axes are swapped!)
June 4, 2012
x − 1, y
x, y − 1 x, y x, y + 1
x + 1, y
4-Neighborhood
x − 1, y − 1 x − 1, y x − 1, y + 1
x, y − 1 x, y x, y + 1
x + 1, y − 1 x + 1, y x + 1, y + 1
8-Neighborhood
96
Figure 29: Different definitions of pixel neighborhoods (left). Which “objects” are connected? (right) (after [J¨
ah02, p. 42])
June 4, 2012
Row Transfer
Serial Register
Pixel Transfer
97
June 4, 2012 98
Figure 32: Examples of histogram equalization on images from Fig. 31 with corresponding final
intensity histograms (from [Gon02, Fig. 3.17, Chap. 3])
June 4, 2012 99
Figure 33: Examples of position/orientation normalization based on normalization of image (grey
level) moments (from [Nie03, Fig. 2.5.7, Chap. 2])
Figure 34: Example normalisation of character slant by applying a shear-transform (from [Nie03, Fig. 2.5.8, Chap. 2])
101
Figure 35: Principle of filtering using masks (from [Gon02, Chap. 3]) Note: Coordinate axes are
swapped w.r.t. usual upper-left convention!
Figure 36: Examples of 1D functions f (x) and the corresponding power spectra |F (u)|, i.e. the magnitude of the 1D-DFT (from [Gon02, Chap. 4])
103
June 4, 2012
Figure 37: Example of a 512 × 512 image containing a 20 × 40 white rectangle and the associated centered logarithmic power spectrum log(1 +
|F (u, v)|) (from [Gon02, Chap. 4])
104
June 4, 2012
Figure 40: Example for smoothing by averaging vs. smoothing via the median
107
Original Morphological Edge (5x5)
June 4, 2012
Figure 41: Example for erosion (=minimum), dilation (= maximum) and edge filtering (“morphological edge”)
Original Opening (5x5)
June 4, 2012
Closing (5x5)
Figure 42: Example for opening (erosion + dilation) and closing (dilation + erosion)
June 4, 2012 110
Figure 44: Examples of behaviour of derivatives at an ideal ramp edge (from [Gon02, Chap. 10])
Figure 46: Example of gradient in x- and y-direction and combined magnitude (from [Gon02, Chap. 10])
June 4, 2012
114
Figure 47: 5 × 5 Laplacian of a Gaussian mask (LoG) (from [Gon02, Chap. 10])
June 4, 2012
115
Figure 48: Example of Sobel operator, Laplacian and Laplacian smoothed with a Gaussian (LoG)
ah02])
Figure 49: Example of some natural textures (from [J¨
Figure 51: Example of optical flow computation: (a) first and (b) second image of sequence, (c) estimated flow field, (d) detail of (c), (f) color coded
flow field, and (f) color code map for vector representation (From: http://www.cs.ucf.edu/˜jxiao/opticalflow.htm)
June 4, 2012 YC
YR YL
XC
f PL f
PR
XR XL
ZR ZL
P (xC , yC , zC )
ZC
119
Figure 53: Quad-tree representation of image segmentation (after [Nie90, p. 108], cf. also [Gon02, p. 616])
120
June 4, 2012
Figure 54: Split-and-Merge Algorithm: Simple sample image in numeric and grey-level representation
121
June 4, 2012
Figure 55: Split-and-Merge Algorithm: Initial segmentation of sample image obtained by selecting 3rd level of quad-tree representation
122
June 4, 2012
Figure 56: Split-and-Merge Algorithm: Result of possible merge operations on the sample image
123
June 4, 2012
Figure 57: Split-and-Merge Algorithm: Result of possible splitting operations on the sample image
124
June 4, 2012
Figure 58: Split-and-Merge Algorithm: Result of merging operations outside the quad-tree structure for the sample image
125
June 4, 2012
Figure 59: Split-and-Merge Algorithm: Final segmentation for result the sample image
126
June 4, 2012
Figure 60: ) Example of results obtained with the Canny Edge Detector
127
June 4, 2012
128
Figure 61: Illustration of the Hough transform (from [Gon02, Chap. 10, Fig. 10.20])
June 4, 2012
129
Figure 62: Example of the Hough transform on sample infrared image (from [Gon02, Chap. 10, Fig. 10.21])
June 4, 2012
130
Figure 63: Scale space representation scheme for images: Gaussian smoothed image at different scales and organized into octaves (left) and
Difference of Gaussian representation (right) (from [Low04])
June 4, 2012
Figure 65: Example of keypoint detection on a natural image: (a) original image (b) initial 832 keypoint locations at maxima and minima in the DoG
representation (from [Low04])
132
June 4, 2012
Figure 66: Example of keypoint detection on a natural image: (a) original image (b) initial 832 keypoint locations at maxima and minima in the DoG
representation (from [Low04])
133
Figure 67: Correspondences between matched keypoints within two images of a well known building
taken from different view points
June 4, 2012
Input image
Γ
ui ui
Mean image
Ψ
Distance measure
Φ − Φ̂
138
Figure 71: Processing steps necessary for implementation of the Eigenface approach (after [Tur91])