Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
3 views

Unit 4 Computer Vision Lecture Notes 1 4 Compress

The document provides an extensive overview of computer vision, detailing its definition, types of images, and various application areas such as autonomous navigation and surveillance. It discusses the fundamental concepts of machine vision, including image acquisition, preprocessing, and local image features. Additionally, the document outlines the differences between computer vision and image processing, emphasizing the complexity and interpretation involved in the former.

Uploaded by

Yash Sehgal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Unit 4 Computer Vision Lecture Notes 1 4 Compress

The document provides an extensive overview of computer vision, detailing its definition, types of images, and various application areas such as autonomous navigation and surveillance. It discusses the fundamental concepts of machine vision, including image acquisition, preprocessing, and local image features. Additionally, the document outlines the differences between computer vision and image processing, emphasizing the complexity and interpretation involved in the former.

Uploaded by

Yash Sehgal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 138

Contents

1 Introduction 4
1.1 What is Vision? ... Computer Vision? . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Types of Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Application Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 The Image Processing / Machine Vision Universe . . . . . . . . . . . . . . . . . . . 7
1.5 Why is Computer Vision Difficult? . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.6 Typical Architecture of a Machine Vision System . . . . . . . . . . . . . . . . . . . 9

2 Fundamentals of Machine Vision 10


2.1 Image Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Image Formation: The Role of Cameras . . . . . . . . . . . . . . . . . . . . 12
2.2 Structure of the Human Eye . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3 Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Human Color Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Color Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Digitization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.4.2 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.3 Representation of Digital Images . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Imaging Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Preprocessing 25
3.1 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1
3.1.1 Intensity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.2 Histogram Equalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.3 Some Other Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.1 Linear Filters and Convolution . . . . . . . . . . . . . . . . . . . . . . . . . 28


3.2.2 Filtering in the Frequency Domain: Overview . . . . . . . . . . . . . . . . . 29
3.2.3 Rank-Order Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Local Image Features 32


4.1 Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1.1 Detection of Local Discontinuities . . . . . . . . . . . . . . . . . . . . . . . 32


4.1.2 Edge Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Texture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.1 Statistical Texture Representation: Co-occurrance Matrices . . . . . . . . . . 36
4.2.2 Spectral Texture Representation: Overview . . . . . . . . . . . . . . . . . . 38

4.3 Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.1 Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.2 Estimation of Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4.1 Stereo Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4.2 Estimation of Stereo-Correspondence via Optical Flow . . . . . . . . . . . . 44

5 Image Primitives 45
5.1 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Region Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.1 Region Growing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2.2 Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2.3 Split-and-Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Contour Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.1 Contour Following . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3.2 Hough-Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4 Keypoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

June 4, 2012 2
5.4.1 The Scale-Invariant Feature Transform (SIFT) . . . . . . . . . . . . . . . . 54

6 Appearance-Based Object Recognition 57


6.1 Template Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2 Matching Configurations of Keypoints . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.3 Eigenimages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.3.1 Formal Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.3.2 Computation of Eigenfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

7 Tracking 63
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7.2 Tracking as an Inference Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 64


7.3 Kalman-Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

June 4, 2012 3
Chapter 1

Introduction

1.1 What is Vision? ... Computer Vision?

Vision: One of the 6 human senses (vision, audition [hearing], haptics [touch] + proprioception [the
kinesthetic sense], olfaction [smell], gustation [taste], equilibrioception [sense of balance])
For humans the primary sense of perception (wrt. information density preceived: 10M bits per
second)
Allows perception of 3D scenes on the basis of 2D mappings (produced by the eyes, “images”)
of the scene
(Perception is the process of acquiring and interpreting sensory information.)

Example: Finding the keys of your car on a cluttered table.


Note/Problem: Though many aspects of the human visual system have been investigated it is
not known how exactly visual percepts (=
ˆ results of perception) are represented in the
brain!

Computer Vision: Realization of visual perception capabilities (known from humans) within artifi-
cial systems (e.g. robots)

Example: Recognizing a soccer ball on the playground amoung other soccer playing robots
(e.g. Aibos)
Note: Here, problem is greatly simplified by e.g. giving playground and ball well-defined
colors and using (rather) controlled illumination.

Central components of all artificial vision systems are digital(!) computers.


⇒ Basis: Digital images / sets of images / image sequences

Computer Vision vs Image Processing: Image processing deals with aspects of manipulating and
interpreting (digital) images in general

4
• Broader methodological basis
(includes: restoration, compression, enhancement, ...; image syntesis = computer graph-
ics)
i.e. covers aspects not relevant from the perspective of (human) perception
• Focus mostly on lower level processing steps
(i.e. the more interpretation/reasoning required the more likely a method will be called
“computer vision” rather than “image processing”)

1.2 Types of Images

• Most widely used image type: “natural” images i.e. resulting from a scene illuminated with
visible light (=
ˆ the type captured with standard cameras)
Slides: TU Dortmund (Fig. 1), Taipei traffic (Fig. 2), ...

Note: When understanding “computer vision” as implementing human perceptual capabilities


on artificial systems this is the data we will be dealing with!
For “image processing” in general many more image types are used.

• Infra-red images, i.e. resulting from radiation of hot surfaces (captured e.g. from satelites)
Slide: North America (IR) (Fig. 3)

Note: Images are rendered in so-called pseudo-colors for visual perception.

• Multi-spectral images, i.e. taking into account other parts of the electromagnetic spectrum
(which visible light is part of)
Slide: LandSat image of Amazonas rain-forest region (Fig. 4)

• X-ray images (from medical applications or from astronomy)


Slide: X-ray image of human body parts (Fig. 5)

Note: In medical applications X-ray “illumination” is absorbed by tissue!


Also: Here X-ray imaging is an active method, as illumination is part of image aquisition
and not “natural” i.e. passively observed.

Slide: X-ray image of “Lockman Hole” (Fig. 6)

• Ultrasonic images (from medical applications, to some extent: robot navigation)


Slide: Ultrasonic image of a human embryo (Fig. 7)

• Depth images (generated by e.g. laser range finder or so called “time-of-flight” cameras)
Slide: Depth image of in-door scene (Fig. 8)

June 4, 2012 5
1.3 Application Areas

Computer Vision

• Autonomous Robot Navigation


Example: Autonomous vehicles, e.g. “Stanley” (built at Stanford University, base: VW Touareg,
won last DARPA Grand Challenge 2005, Sensors: 5 laser scanners and 1 color camera), “High-
lander” (built at CMU, base: Hummer, short and long range LIDAR [= Laser Imaging Detection
and Ranging, i.e. produces depth images] and RADAR)
Slide: “Stanley” and “Highlander” with sensors (Fig. 9)

• Surveillance i.e. monitoring activity of persons, vehicles etc. in public spaces

– Completely automatic solutions for reading of license plates


– Surveillance of persons’ activities in specialized situations already possible (activity [e.g.
attempts of theft] in a parking lot, attempts of bank robberies)
Slide: Surveillance of parking lot at IBM (Fig. 10)
– Surveillance of persons (individuals) on the horizon

• Face Recognition (i.e. detection and/or identification [e.g. for biometric access control]
Slide: Examples of face detection (Fig. 11, Fig. 12)

Applications more to be associated with “Image Processing”

• Identification of workpieces in industrial settings (on assembly line)

• Automatic reading of postal addresses (largely for machine printed texte, for handwritten ap-
prox. 50% “finalization” in the US as of 2000 [cf. IWFHR 7])

• Quality control in manufacturing (by automatic visual inspection)

• Monitoring/Analysis of biological/physical/chemical/... processes (e.g. groth of cell popula-


tions, finding traces of elementary particles)

• And not to forget about Automatically guided weapons (e.g. cruise missiles use image of goal,
used for correlation search in final phase of flight)

June 4, 2012 6
1.4 The Image Processing / Machine Vision Universe

June 4, 2012 7
1.5 Why is Computer Vision Difficult?

• Humans use contextual knowledge and knowledge about the world


Slide: “Lampenrätsel” from Jähne [Jäh02, p. 17] (Fig. 13Welches Objekt ist keine Lampe?)

• Human visual system is highly specialized, e.g.

– Many pre-attentional effects (e.g. “Gestalt laws”)


Slide: Gestalt laws (Fig. 14)
– “Important” visual cues are processed especially efficiently / reliably (e.g. human faces,
motion)
– Many non-linear effects, e.g. in perception of image intensity
Slide: Election recount (Fig. 15)

• When trying to rule out these effects ...


Slide: Digital face image in “numerical domain” (Fig. 16

June 4, 2012 8
1.6 Typical Architecture of a Machine Vision System

June 4, 2012 9
Chapter 2

Fundamentals of Machine Vision

2.1 Image Acquisition

Goal: Mapping/Projection of a 3-dimensional scene onto a 2-dimensional (digital) image (in the
memory of a machine vision system)
Procesing Steps:

a) Visualization
by means of physical processes of ...

– Reflection (e.g. “ordinary” photographic images)


– Absorption (e.g. medical X-ray imaging)
– Emission (e.g. infra-red imaging, astronomy (X-ray, radio)

... of radiation by objects / scenes.

b) Image Formation
An imaging system (e.g. a camera, electromagnetic field) projects the radiation originating
from the 3-dimensional scene onto a 2-dimensional image plane

c) Digitization (= Sampling + Quantization)


The continuous image formed by the radiation incident on the image plane is sampled in a
discrete grid (i.e. incident radiation is measured [what? ... usually intensity, more later] at
discrete points on the image plane (values measured are stored).
Note: Measurement values are still continuous!
Sampled intensity values are additionally mapped onto a finite set of discrete values – i.e. quan-
tized.
Digitization in the spatial domain (image plane) – i.e. sampling – and in the amplitude domain
– i.e. quantization – produces a digital image.

10
2.1.1 Visualization

Most “widely used” / “readily available” type of radiation for the visualization of objects / scenes:
Electromagnetic radiation.
Slide: Overview of the electromagnetic spectrum (Fig. 18)Note: Similar graphic also in [Gon02]
Most important part of the electromagnetic spectrum for human and computer vision is visible light
with wave lengths ranging between approx. 400 to 800 nm.
⇒ Will consider only this type of radiation further!
Remarks:

• The el-mag. spectrum is continuous, i.e. arbitrary wave lengths can occur (ignoring quantum
effects!)

• Real radiation is in general not homogeneous but consists of a mixture of different wave lengths

⇒ therefore the following can be measured:

• Intensity of the radiation (in a given range of wave lengths; for visible light: brightness)

• Composition of the total radiation from parts with different wavelengths (for visible light:
“color” [Beware: Color is not an objective but a subjective measure!])

Intensity

For so-called “grey level/scale images” only the intensity is measured which is reflected from an
object / a scene.
The observed (light) intensity results from:

• the intensity of the illumination sources

• the spatial configuration (distance, orientation) of illumination sources and (reflective) object
surfaces

– intensity is inversely proportional to the squared distance (object to source)


– illumination effect is maximum if direction of incident light is prependicular to surface

• the position of the observer (e.g. camera)

• the reflectivity of object surfaces involved.

June 4, 2012 11
Surface reflection is composed of

– specular (i.e. mirror-like) and

– diffuse (i.e. reflected in all directions)


part

Proportions of specular and diffuse reflection


vary for different surfaces.

• the absorption of incident light by object surfaces (almost complete for visible light: black
surfaces)

Note: Observed intensity results from interplay of all factors (Is in general dependent on illumination
sources and complete scene)
⇒ Inverse problem in computer graphics: Generating realistic illumination for scenes.

Spectral Composition

• Spectral composition of illumination (here: from el-mag. spectrum, especially light) is primar-
ily dependent on

– the illumination sources and


– the absorption by object surfaces (reduces intensity of specific parts of the spectrum)

• The (varying) spectral composition of visible light (i.e. the physical quantity) produces the
sensation of color (i.e. subjective) in the human visual system

Note: There is no one-to-one mapping between spectral composition of light and perceived color!

More about color later!

2.1.2 Image Formation: The Role of Cameras after [For03, Chap. 1]

In image formation an imaging system (mostly: an optical system) projects an image of a 3-dimensional
scene or object onto a 2-dim. image plane.

June 4, 2012 12
A Simple Imaging Device

Experiment: Take a box, prick a small hole in to one of its sides (with a pin), replace the side
opposite to the hole with a translucent plate.
Hold the box in front of you in a dimly lit room, with the pinhole facing a candle (i.e. a
light source) ...
What will you see? — An inverted image of the candle
⇒ Camera Obscura (invented in 16th century)

The Pinhole Camera Model

Idealized optical imaging system / simplest imaging system imaginable (idealization of camera ob-
scura)

Principle: Infinitely small pinhole ensures that exactly one light ray originating from a point in the
scene falls onto a corresponding point in the image plane.
I.e. exactly one light ray passes through each point in the image plane, the pinhole, and some scene point.

Note: Despite its simplicity the pinhole model often provides an acceptable approximation of the
imaging process!

• Pinhole camera defines central perspective projection model.

• Perspective projection creates inverted images ⇒ More convenient to consider virtual image
on plane in front of the pinhole.

• Obvious effect of perspective projection: Apparent size of objects is dependent on their dis-
tance.

Geometry of pinhole projection:

June 4, 2012 13
after [For03, Fig. 1.4, p. 6]

• Coordinate System (O, i, j, k) attached to the pinhole of the camera (Origin O coincides with
pinhole)

• Image plane is located at positive distance f ′ from pinhole

• Line perpendicular to image plane and passing through O is called optical axis, point C ′ image
center

Let P denote some scene point with coordinates (x, y, z) and P ′ its image at (x′ , y ′ , z ′ ).
As P ′ lies in the image plane: z ′ = f ′
~ ′ = λOP
As P , O, and P ′ are colinear: OP ~ for some λ

Consequently, we obtain the following relations between coordinates of scene and image points:
   
x′ x
 ′    x′ y′ f′
 y  = λ y  ⇔ λ = = =
x y z
f′ z

and finally:
x y
x′ = f ′ and y ′ = f ′
z z

Note: Perspective projection model can be further simplified by assuming that scene depth is small
with respect to scene distance ⇒ scene points have approximately identical distance z = z0 ⇒

constant magnification m = −zf (weak perspective projection)
0

June 4, 2012 14
Cameras With Lenses

Disadvantage of pinhole principle: Not enough light gathered from the scene (only single ray per
image point!) ⇒ Use lens to gather more light from scene and keep image in focus

Note: As real pinholes have finite size the image plane is illuminated by a cone of light rays: The
larger the hole, the wider the cone ⇒ image gets more and more blurred.

Behaviour of lenses defined mainly by geometric optics (ignoring physical effects of e.g. interference,
diffraction, etc.):
surface/interface normal
• In homogeneous media light (rays) travels in straight
lines
α1 α1
• When light rays are reflected from surfaces, this ray,
the reflection, and the surface normal are coplanar; the incident lignt

angles between between normal and rays are comple- (specular) reflection
n1
mentary

• When passing from one medium to another light rays n2


are refracted (i.e. change direction); the original ray, refracted light

the refracted one and the normal to the interface are


α2
coplanar; the change of direction is related to the in-
dexes of refraction according to:

n1 sin(α1 ) = n2 sin(α2 )

⇒ Consider refraction and ignore reflection (i.e. won’t consider optical systems which include
mirrors as, e.g., telescopes)

Assumptions:

• Angles between light rays passing through a lens and the interface normal (normal to the re-
fractive surface) are small.

• Lenses are rotationally symmetric around the optical axis.

• Lenses have a circular boundary.

⇒ Paraxial geometric optics


Slide: Illustration of paraxial refraction (Fig. 19)

• For αi , βi and γ the following hold:

α1 = γ + β 1 α2 = γ − β 2

June 4, 2012 15
• As angles are approximately equal to their sines or tangents:

h
sin γR = h ⇒ γ ≈
R

(Analogously the following can be derived:

h
tan βi di = h ⇒ βi ≈ for i = 1, 2
di
)

• Inserting into the law of refraction (assuming small angles) yields:


n1 n2 n2 − n1
n 1 α1 ≈ n 2 α2 ↔ + =
d1 d2 R

Note: Relationship between d1 and d2 depends only on R and the indexes of refraction n1 and n2 but
not on βi .

Assumptions for (Thin) Lenses:

• Lens has two spherical surfaces (both with radius R)

• Lens has index of refraction n, surrounding medium of 1 (vacuum or approximately air).

• Lens is thin i.e. ray entering is refracted at the frontal interface and immediately at the backward
one.

Slide: Illustration of refraction with a thin lens (Fig. 20)


Remarks:

• Scene point P is projected (in focus!) onto image point P ′ .

• Light ray (P O) through center of lens is not refracted.

• F (and F ′ ) is focal point of the lens at distance f from O.

• The relations between object and image distance −z/z ′ , size y/y ′ and focal f length of the lens
are defined by the thin lens equation:
1 1 1 R

− = where f=
z z f 2(n − 1)
y z
also: = ′
y′ z
• If z approaches infinity, image points are in focus at distance f .

June 4, 2012 16
Note: In practice objects within some limited range of distances (the so-called depth of focus) are in
acceptable focus.
Outside the focal plane images of scene points are projected onto a circle of finite size (i.e.
blurred).

Real Lenses suffer from a number of aberrations:

• Spherical aberration:

Light rays from P form a circle of confusion in the image plane. (Circle with minimum diam-
eter = circle of least confusion, in general not located in P ′ )

• Distortion: pincushion and barrel distortion.

• Chromatic aberration resulting from the index of refraction beeing dependent on wavelength.
Can be shown using a prism.

Real optical systems use multiple lenses:

• Multi-lens arrangements can be used to minimize aberrations.

• However, multi-lens systems suffer from vignetting effects as light rays are blocked by apper-
tures of different lenses.

June 4, 2012 17
2.2 Structure of the Human Eye cf. [For03, Sec. 1.3], [Gon02, Sec. 2.1]

In general, the human eye is an optical imaging system (i.e. a “camera”), which projects a scaled
down, inverted image of the scene onto the background of the eye (= retina).
Slide: The human eye as an imaging system (Fig. 21)

Slide: Schematic structure of the human eye (Fig. 22)


Remarks:

• The cornea serves as a protective shield.

• The iris relizes the aperture, can be adjusted in size by the cilary muscle (synchronously for
both eyes).

• The lens is flexible and able to adjust its refractive power via the tension of the cilary fibers.

– Far objects: Lens is relatively flat (from 3 m and above)


– Near objects: Lens is rather thick.

• The retina contains 2 types of photoreceptors:

– Rods: very sensitive to light (approx. 120M)


– Cones: sensitive to color in 3 sub-types (roughly to “red”, “green”, and “blue”; approx.
6M)

• Cones mostly found in the area of the fovea


Slide: Distribution of rods and cones on the retina (Fig. 23)

• Muscles rotate the eye such that objects of interest are projected onto the fovea.

• The blind spot, i.e. the path of the optical nerve, contains no photoreceptors (Note: Nerve
connections of photoreceptors lie before the retina).

2.3 Color cf. [Gon02, Sec. 6.1, 6.2]

2.3.1 Human Color Perception

... is based on 3 types of color sensitive photoreceptors in the retina (cones), which respond differently
to the wavelengths of the visible spectrum.

Slide: Response of cones to different wavelengths (Fig. 24)


Remarks:

June 4, 2012 18
• Excitation of receptors is maximum for “red”, “green”, or “blue” light, respectively. However,
cones respond to complete visible spectrum!

• Intensity of response (=
ˆ sensitivity of receptors) is highest for “green” light and decreases
towards the boundaries of the visible spectrum.

• Homogeneous light of a single wavelength creates “pure” color sensation, which, however, can
also be produced by combination of appropriate wavelengths.

Color perception of the human eye can be represented by the CIE-diagram based on the following
definitions:
Let X, Y , Z be the absolute responses of the color sensors
Let the total intensity be: X + Y + Z = I
The relative color portions are then given by:
X X Y Z
x= = y= z=
I X+Y +Z I I
⇒ x + y + z = 1 orz = 1 − x − y (i.e. only 2 independent quantities)
⇒ colors can be represented in x/y -plane
Slide: Chromaticity diagram (Fig. 25)Note: Similar graphic also in [Gon02, Chap. 6]
Remarks:

• Pure spectral colors are found on the perimeter of the “tongue-shaped” color space. Those
colors have maximum saturation.

• Colors in the inner area result from a mixture of pure colors.

• Mixing two color components can potentially create all colors lying on the straight line beween
the two points in the diagram (mixing of 3 colors ⇒ triangle).

2.3.2 Color Models

The RGB Color Model

... motivated by primary sensitivity of human color receptors for (roughly) red, green, and blue light
⇒ RGB
Required: Specification of “base colors” for red, green, and blue (different possibilities!)
For maximisation of colors that can be represented ([Bal82, p. 32]):
λ1 = 410nm (blue)
λ2 = 530nm (breen)
λ3 = 650nm (red)

June 4, 2012 19
By superpositions of those wavelengths most but not all colors can be generated (=
ˆ triangle in CIE
diagram).

Note: Mixing of colors (i.e. superposition) in “wave length space” is additive as opposed to mixing
of color pigments for printing (subtractive color mixing).

When normalizing the maximum intensity in the color channels to 1 (minimum is 0, as no “negative”
frequency components possible) one obtains the following RGB color cube:

Drawback: Color information can not be separated from intensity information

⇒ alternative color models of the form: intensity + “color”

The HSI Color Model

H = Hue (“color tone”), roughly proportional to the average wagelength of a mixture of primary
colors

S = Saturation, represents the “missing” of a white component within a mixture of colors

I = Intensity

The HSI model defines HS color planes (either of circular or hexagonal shape) perpendicular to the
intensity axis of the model ⇒ H, S, I are usually given in cylindrical coordinates.
Slide: HSI color space (Fig. 26)

The YUV Color Model cf. Dictionary by LaborLawTalk

... used for PAL television and slightly modified (e.g. YCbCr with Cb, Cr beeing scaled versions of
U, V) in computer component video.

Y = Luminance, i.e. the brightness component

U, V = Chrominance, i.e. “color” components

June 4, 2012 20
YUV can be computed from (normalized) RGB as follows:

Y = +0.299R + 0.587G + 0.114B


U = +0.492(B − Y )
V = +0.877(R − Y )

Remarks:

• Luminance Y is weighted sum of R, G, B intensities with dominant weight for “green”.

• U and V are (appropriately scaled) differences to original B (“blue”) and R (“red”) compoments.

Subtractive Color Mixing: The CMY[K] Model

... is used for producing colors in print, i.e. by using color pigments and not illumination sources ⇒
“subtractive” color mixing
Color pigments absorb certain spectral components of incident light and reflect only some remaining
wavelength components (e.g. “red” pigment absorbs “green” and “blue” components and reflect only
“red” part of spectrum).
Prinary colors for “subtractive” color mixing (better: mixing of coloring substances [i.e. pigments]):
CMY model

C = cyan (i.e. bluish green) =


ˆ W −R

M = magenta (≈ purple) =
ˆ W −G

Y = cyan (i.e. bluish green) =


ˆ W −B

Primary colors “subtract” one of the primary colors of the additive RGB model from white light (W ).
Mixtures can, therefore, be calculated with respect to the RGB model:

C ⊕ M = (W − R) + (W − G) =
=W + W} −R − G =
| {z
= W −R−G=
= (R + G + B) − R − G = B

Note: For the purpose of producing printed documents the CMY model is often realized as CMYK
with an additional K for black pigment, as C ⊕ M ⊕ Y usually does not produce acceptable
black for e.g. typesetting text.

Slide: Additive vs. subtractive color mixing (Fig. 27)

June 4, 2012 21
2.4 Digitization

... comprises two processing steps, namely sampling of an image (in general: some analog signal) in
the spatial (or time) domain and quantization of the samples (=
ˆ measurements).

2.4.1 Sampling

(for images) ... is the measurement of the “content” of an analog image (e.g. the intensity) at the
discrete points of a 2-dimensional grid.
The topology of the grid is defined by the sensor arrangements of the imaging device used (Sect. 2.5).

Most common topology: regular 2-dim array.

Problem: Process of sampling can already lead to loss of information!


Example: Periodical signal is sampled at 109 its wavelength ⇒ results in aliased wavelength 10
times longer than original!

(after [Jäh02, Chap. 2])

Theoretical Result: The so-called sampling theorem states that sampling a continuous signal can
be achieved without loss of information, if sampling frequency is at least twice as high as the
highest frequency component present in the signal.

2.4.2 Quantization

... is the mapping of analog/continuous samples/measurements onto a discrete (especially finite) set
of values.
Q
[fmin, fmax ] −→ {b0 , b1 , b2 , ...bL−1 }

June 4, 2012 22
Quantisation of continuous samples (after nach [Nie03, S. 68])

Note: Quantization necessarily introduces errors into the digitization process ⇒ exact reconstruction
of original signal no longer possible.

Remarks:

• Stored are not discrete quantities bi but their indices i.

• Behaviour of quantisation process is completely specified by defining the quantized values to


be used and the associated intervals on the continuous side that will be mapped upon them.

Note: The characteristic curve of the quantization need not be linear (nonlinear characteristic
useful for, e.g., quantization of X-ray images due to nonlinearity of absorption by tissue).

• For digital processing L = 2B quantization steps appropriate.

– For grey level images B = 8 bits (i.e. L = 28 = 256 discrete intensities) yield gut
subjective reproduction of the image.
– For color images B = 8 bits per color plane (R, G, B).

2.4.3 Representation of Digital Images

A grid point (i.e. measurement point) in a digital image is called pixel (or pel = picture element). Its
position in the M × N image array is specified by the row and column index x = 0, 1, ...M − 1 and
y = 0, 1, 2, ...N − 1 (Beware: Indices do not correspond to actual spatial positions!).

Note: When representing digital images mostly the so-called upper-left coordinate convention is
used, i.e. the origin of the image matrix lies at the upper left corner of the image.

Slide: Coordinate convention used with digital images (Fig. 28)

June 4, 2012 23
Note: Upper-left convention defines no right-handed coordinate system.
⇒ computing angles affected!

For segmentation (e.g. separating [foreground] objects from background) the definition of pixel
neighborhood on the image matrix/grid is important. Definition of connectivity is affected by this
choice.
Slide: Different definitions of pixel neighborhoods and effect on connectivity (Fig. 29)

Note: Problem with neighborhood variants and resluting connectivity can be resolved using a hexag-
onal image/sampling grid (found in some digital cameras).

2.5 Imaging Devices

Most widely used type of digital camera: CCD camera (CCD = charge coupled device).

• A CCD sensor uses a rectangular grid of photosensitive elements (of some finite size).

• Incident light (photons) cause a electrical charge to build up in each element

• Charges of individual sensor elements are “read out” – i.e. moved out of the sensor array – by
using charge coupling (a row at a time).

Slide: Structure of a CCD device (Fig. 30)


Remarks:

• In order to build up charge in sensor elements some time of exposure to incident light has to be
allowed.

• E.g. for video applications image contents are read at some fixed rate (25 Hz for PAL, actually
reading odd and even rows separately at a rate of 50 Hz ⇒ motion can cause distortions).

• Capturing “color” requires sensors to be made sensitive to different parts of the electromagnitic
spectrum:
– Using color filters in front of a single CCD chip and G R G R G R

B G B G B G
reading R, G, B images sequentially
G R G R

B G B G
– Using a beam splitter and color filters with 3 different
G R G R

CCD chips for R, G, B B G B G

G R

– Using a pattern (mosaic) of sensor elements sensitive B G

to R, G, B (e.g. Bayer pattern [invented by Dr. Bryce


Bayer pattern of color filters
E. Bayer of Eastman Kodak]).

June 4, 2012 24
Chapter 3

Preprocessing

Goal: “Preparation” of images such that better results are achieved in subsequent processing steps
(e.g. segmentation).
Preprocessed images are better suited for future processing.
⇒ image enhancement

Note: There is no such thing as the preprocessing operation! (Techniques are higly application de-
pendent)

General Principle/Idea: Reduce unwanted variability in images/data

No in-depth treatment of preprocessing techniques! Only principle idea and typical example methods!

3.1 Normalization

Problem: Images/objects in images usually have parameters that vary within certain intervals (e.g.
size, position, intensity, ...). However, results of image analysis should be independent of this
variation.

Goal: Transform images such that parameters are mapped onto normalized values (or some appro-
priate approximation).

3.1.1 Intensity

Interpretation of image content is usually independent of (local) image intensity ⇒ normalization!


Slide: Basic image types: dark, light, low contrast, high contrast (Fig. 31)

25
• Normalization to Standard interval [0, a], e.g. [0, 255]:
Transform original grey valye fij into normalized value hij according to:

a(fij − fmin)
hij = where fmax = max fij , fmin analogously
fmax − fmin ij

• Normalization to zero mean and unit variance:


Let µ be the mean image intensity and σ 2 the associated variance:

1 X 1 X
µ= P fij σ2 = P (fij − µ)2
ij 1 ij ij 1 ij

Normalized intensities hij will have µ′ = 0 and σ ′2 = 1:

fij − µ
hij =
σ

Remark: Resulting intensities are no longer integers!

Note: Global normalization of intensity usually simpler but less effective than more complicated
local normalization!

3.1.2 Histogram Equalization

... most “popular” normalization method based on the grey level histogram of an image given by:
(
X 1 if x = 0
h(g) = # of grey values g in image = δ (fij − g) where δ (x) =
ij
0 otherwise

From the histogram the (estimated) probability of a pixel having a certain grey level g can be derived:
1 1
p(g) = P h(g) = h(g) for M × N images
ij 1
MN

Goal: Use complete dynamic range of available grey levels within an image in order to resolve small
but frequent differences better.

Idea: Grey level intervals of high density in the histogram should be “stretched” and those with low
density should be “compressed”.
⇒ In general for the discrete case the number of grey levels is reduced by “compression”.

June 4, 2012 26
Method Let the cumulative distribution function of the grey values be:
f f
X X 1
H(f ) = p(g) = h(g)
g=0 g=0
MN

(directly obtainable from the grey level histogram)


The transform defined by
f
1 X
T (f ) = ⌊L · H(f )⌋ = ⌊L · h(g)⌋
MN g=0

(where L is the number of grey levels available and ⌊x⌋ represents the largest integral number that is
equal or less than x)
achieves an approximate equalization of the grey level histograms, i.e. the histogram of the trans-
formed grey levels is approximately equal to a uniform distribution.

Note: In the discrete case the equalized histogram will in general not be equal to a uniform distribu-
tion. This can be achieved for the contiuous case only!

Note: To some extent also a normalization of intensity is achieved implicitly.

Slide: Examples of histogram equalization (Fig. 32)

3.1.3 Some Other Techniques

Size, Position, Orientation, ...


Slide: Example of position/orientation normalization (Fig. 33)

Slide: Example normalisation of character slant (Fig. 34)

3.2 Filtering

Image filtering comprises image transforms that work on a certain neighborhood of pixels (usually
rectangular or quadratic area centered at the pixel in question) for deriving a new grey value.
Simple formalization of filtering:

• Move a window (which has the size of the neighborhood considered) from point to point in the
image

• Calculate the response of the filter at each point by applying some operation to the pixel grey
values within the window.

⇒ Filters are implemented using local image operations, realize locally defined image transforms!

June 4, 2012 27
3.2.1 Linear Filters and Convolution

An expecially relevant case of filters are linear (the transform satisfies T {af +bg } = aT {f }+bT {g}),
shift-invariant (transform is independent of pixel posision) filters.
Linear filters can be realized as follows:

• Define a filter mask with the size of the neighborhood and filter coefficients/weights w(s, t)
assigned to each point of the mask.
Slide: Principle of filtering using masks (Fig. 35)

• Move the mask over the image from point to point

• Calculate the filter response by a weighted sum of pixel grey values and mask coefficients.
For a 3 × 3 mask the filter response is given by:

g(x, y) = w(−1, −1)f (x−1, y −1)+w(−1, 0)f (x−1, y )+...+w(0, 0)f (x, y)+...w(1, 1)f (x+1, y+1)

(Note: Coefficient w(0, 0) coincides with f (x, y), i.e. mask is centered at position (x, y))

In general the result of applying a linear filter w() of size m × n (with m = 2a + 1 and n = 2b + 1)
to an image f () of N × M pixels is given by:
a b
X X
g(x, y) = w(s, t)f (x + s, y + t)
s=−a t=−b

Note: This formulation is equivalent to computing the cross-correlation between the mask and the
image, which is similar to the concept of convolution:
XX XX
h = f ∗ g ⇔ h(x, y) = f (s, t)g(x − s, y − t) = f (x − s, y − t)g(s, t)
s t s t

For convolution either the signal f () or the mask w() need to be mirrored along the x- and
y-Axis, which is unproblematic as masks are frequently symmetric.

Example: Image Smoothing with Linear Filters

• Image Averaging
1
Use m × n window with weights w(s, t) = mn

Note: Undesirable effects on image content in frequency domainր!

• Gaussian Smoothing
Required: Filter coefficients for discrete 2D approximation of Gaussian

June 4, 2012 28
discrete 1D-Gaussian can be obtained from the rows in Pascal’s Triangle:
1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
3 × 3 Gaussian filter mask obtained from
   
1 1 2 1
   
 2  × (1 2 1) =  2 4 2 
1 1 2 1

Note: In order not to amplify the image content masks are usually normalized by the sum of their
coefficients (for Gaussian: 161 ).

3.2.2 Filtering in the Frequency Domain: Overview

Motivations:

• Better understanding of effects of filtering on images (only with more mathematical details!)

• Increased efficiency for computation of responses for large linear filters

Basis: Discrete Fourier Transform (DFT)


M −1 N −1
1 XX ux vy
F (u, v) = f (x, y)e−i2π( M + N )
MN
x=0 y=0

• Computes a frequency representation of a signal/image (Note: Both f and its transform F are
complex!)

• Input signals are roughly approximated by sine and cosine functions of different frequencies
and amplitudes (Note: e−ix = cos x + i sin x).

• Assumes discrete periodic input ⇒ finite images are treated as if repeated periodically!

Slide: Examples of 1D-DFT (Fig. 36)

Slide: Example of 2D-DFT (Fig. 37)

Note: Example shows that averaging filter (rectangular shape spatial response) is no good smooting
operation as effects are infinite in frequency domain!
⇒ Betters solution: Gaussian, as it is form-invariant under Fourier transform!

June 4, 2012 29
Important property of the Fourier transform:
Convolution in the spatial domain is converted to multiplication in the frequency domain and vice
versa:
f = g∗h⇔F = G·H
⇒ Convolution operations can be computed more efficiently in frequency domain! (Note: For fil-
tering additionally computation of forward and backward transform necessary ⇒ beneficial only for
large filter masks!)

3.2.3 Rank-Order Filters

(also known as order-statistics filters)

... in general non-linear spatial filters whose response is based on an ordering (ranking) of intensity
values within the neighborhood considered.
Given a neighborhood V (x, y) = {f (x + s, y + t)| − a ≤ s ≤ a ∧ −b ≤ t ≤ b}
Order pixel values (i.e. intensities) as follows:

R(x, y) = {r1 ≤ r2 ≤ ... ≤ rK |rk ∈ V (x, y) ∧ K = |V (x, y)|}

A rank-order (filter) operation is then defined as a function of R(x, y):

h(x, y ) = φ{R(x, y )}

Simple Examples:

0 2 4 6 → R(x + 1, y) = {2, 3, 4, 4, 5, 5, 6, 6, 6}

V (x, y) = 1 4 5 6 3, 4, 4, 5, 6}}
→ R(x, y) = {0|, 1, 1, 2,{z

1 3 6 5 r(K+1)/2 = 3

0 0 1

V (x, y) = 0 1 1 1, 1, 1, 1, 1}}
→ R(x, y) = {0|, 0, 0, 0,{z

0 1 1 r(K+1)/2 = 1

June 4, 2012 30
The following “well known” rank-order / order-statistics filters/operations can be defined:
h(x, y) = r1 Erosion
h(x, y) = rK Dilation
h(x, y) = r(K+1)/2 Median
h(x, y) = rK − r1 Edge detection (aka “morphological edge”)
(
r1 if f (x, y) − r1 < rK − f (x, y)
h(x, y) = Edge sharpening
rK otherwise

Slide: Example of erosion (combined with dilation) (Fig. 38)

Slide: Example of dilation (combined with erosion) (Fig. 39)

Remarks:

• Erosion + Dilation = Opening

• Dilation + Erosion = Closing

• Median as a smooting operation preserves contrast/edges (in contrast to e.g. averaging) but
removes “salt-and-pepper” noise (i.e. small “errors”).

• Computation of order-statistics requires sorting of pixel intensities in a large number of neigh-


hborhoods ⇒ efficient implementations required!

Slides: Example images for averaging, median (Fig. 40), erosion/dilation, morphological edge
(Fig. 41), and opening/closing (Fig. 42)

June 4, 2012 31
Chapter 4

Local Image Features

... features (numerical values!) of individual pixels or pixel neighborhoods that are relevant for image
segmentation and/or interpretation.

In the simplest case: (smoothed) image intensity/color (i.e. without noise)


However, more complicated local features exist.

4.1 Edges cf. [Gon02, Sec. 10.1]

For segmentation boundaries between regions of images are especially relevant, which become man-
ifest by discontinuities in the image.

4.1.1 Detection of Local Discontinuities

Local discontinuities =
ˆ (usually) difference in image intensity/grey level
Remarks:

• Definition of “difference” more complicated for color images (i.e. multi-channel images)

• Grey level may be substituted by some local image feature.

Note: Usually pixels where relevant grey level differences can be observed are called edge pixels or
edge elements (edgels) vs contours, i.e. boundaries of regions.

Discrete Differentiation

Differentiation in the discrete case =


ˆ computing local differences (approximation of continuous dif-
ferentiation)

32
Slides: Examples of edge types (ideal, ramp) (Fig. 43), and behaviour of derivatives (Fig. 44)
⇒ Computing local differences (by some method) for every pixel yield new, i.e. an edge image!
2 commonly used methods:

• 1st (discrete) derivative =


ˆ gradient

• 2nd (discrete) derivative =


ˆ Laplacian

a) Gradient
For 2D signals (e.g. images) f (x, y) the gradient
! !
∂f (x,y)
∂x fx
~g (x, y) = ∂f (x,y) =
∂y
fy

represents the direction and the magnitude of a change in image intensity at position (x, y).
The magnitude of the gradient can be computed as:
q
gmag = |~g (x, y)| = fx2 + fy2

or can alternatively be approximated by:

gmax = |fx | + |fy |

The direction of the gradient (perpendicular to the contour!) is given by:

fy
gdir = arctan
fx

Note: Assumes cartesian coordinate system, needs to be adapted for upper left image coordi-
nates!
For discrete signals (e.g. images) partial derivatives fx , fy need to be approximated by local
differences ∆i f, ∆j f . For computing those the following possibilities exist:

 f (i, j) − f (i − 1, j)
 backward gradient
∆i f = fi = f (i + 1, j) − f (i, j) forward gradient

 f (i + 1, j) − f (i − 1, j) symmetric gradient

(∆j f defined analogously)

b) 2nd Derivative
... not grey-level difference but change in grey-level curvature is interpreted as an edge.

Potential advantage: “wide” edges can be suppressed if only fast transition from positive to
negative curvature is considered as an edge.

June 4, 2012 33
Usually approximated for discrete signals (images) by:

∆2ii f = fii = f (i + 1, j) − 2f (i, j) + f (i − 1, j)

For computation of edge pixels usually the Laplacian is used

∂ 2f ∂ 2f
▽2 f (x, y) = +
∂x2 ∂y 2
which can be approximated in the discrete case by

2
▽2 f = ∆2ii f + ∆jj f = 4f (i, j) − (f (i + 1, j) + f (i − 1, j ) + f (i, j + 1) + f (i, j − 1))

Note: Depending on the definition used the sign of the Laplacian may be inverted (as in the
definition above which follows the notation used in [Gon02]!

Problem: Applied in isolation both the gradient and the Laplacian are very sensitive to noise in
images!

Slide: Example of ramp edge corrupted by Gaussian noise with increasing variance showing
effect in 1st and 2nd derivatives (Fig. 45)

4.1.2 Edge Operators

... i.e. (convolution?!) masks for detecting local discontinuities

• For computation of the gradient (so-called Prewitt operator):

Remarks:

– Orientation of filter masks depends on orientation of coordinate axes (here: upper left!).
– Variants exist for detecting diagonal edges.

Slide: Example of gradient in x- and y-direction and combined magnitude (Fig. 46)

• Laplacian operator:

June 4, 2012 34
Remarks:

• Coefficients of edge operator masks sum to zero!

• Different sizes of operator mask may be used.

Note: As gradient/Laplacian are very sensitive to noise usually operators combining edge detection
and smoothing are applied.

• Sobel Operator

• Laplacian of a Gaussian (LoG)


Slide: 5 × 5 Laplacian of a Gaussian mask (LoG) (Fig. 47)

Slide: Example of Sobel operator, Laplacian and Laplacian smoothed with a Gaussian (LoG) (Fig. 48)

4.2 Texture cf. [Gon02, Sec. 11.3.3], [For03, Chap. 9]

“Texture is a phenomenon that is widespread, easy do recognise and hard to define”

[For03, p. 189]

Slide: Examples of textures (Fig. 49)

Problem: Appearance of surfaces or boundaries between object surfaces can not be described by
characteristics of a single pixel (e.g. grey-level or color)!
⇒ (local) neighborhood of pixels needs to be considered

June 4, 2012 35
Problem: How can local characteristics of textured surfaces be represented?
⇒ Principle approaches:

• Filter-bank based representations (cf. [For03, Chap. 9.1])


• Spectral representations (cf. [Gon02, pp. 670–672])
• Statistical representations (cf. [Gon02, pp. 666–670])

4.2.1 Statistical Texture Representation: Co-occurrance Matrices

Goal: (Statistical) Description of relations between pixels’ grey values in a local neighborhood.

Assumption: Texture can be characterized by frequencies, with which two pixels with given grey
value occur in some given (relative) distance and with given orientation.

Let ni,j (d, α) be the number of times that pixels


with grey level i and j occur with distance d and f(u,v) = j
orientation α. d {
}|
z
α

f (x, y) = i
Remarks:

• If the total number of grey levels in an image is L one obtains L × L co-occurrance counts
ni,j (d, α).

• For counting grey level co-occurrances a local neighborhood of given size must be specified
(How to treat image boundaries?).

Definition:
The normalized grey-level co-occurrance matrix (GLCM) G(d, α) is defined as:
 

 ni,j (d, α) 

G(d, α) = [gi,j (d, α)] =  L−1 L−1 
P P 
ni,j (d, α)
i=0 j=0

Note: A single GLCM is not sufficient for describing a texture (only a single distance and orientation
parameter is used)!
⇒ Matrices need to be computed for several different distance and orientation parameters!

June 4, 2012 36
Example:
0 1 2 3

0 2 2

Neighborhood: 1 2 1
G(1, 0◦ ) × 12
2 2 2
0 0 3 3
3 2
0 0 3 3

2 2 1 1
0

2 2 1 1
1 4
×81
G(2, 90◦ )
2 4
(boundary treatment: truncation)

Problem: GLCMs considerable increase the parametric representation of textures (multiple matrices
for different d and α required)!
⇒ Calculate features derived from GLCMs

Possible Features based on GLCMs:

a) Energy
L−1 L−1
X X
2
e(d, α) = gi,j (d, α)
i=0 j=0

Defines a measure for the “regularity” of a texture, because:

– in regular textures only few strong/prominent grey-level differences occurr ⇒ only few
large gi,j (d, α) ⇒ summation of squared values large
– otherwise many/all grey-level differences exist ⇒ many small gi,j (d, α) ⇒ summation of
squared values small

b) Contrast
L−1 L−1
X X
c(d, α) = (i − j )2 gi,j (d, α)
i=0 j=0

Measures the magnitude of local changes in grey value:

– homogeneous regions (i.e. with i = j) will be suppressed

June 4, 2012 37
– large differences are weighted strongly: (i − j )2

Can be rewritten as:


L−1
X X
c(d, α) = ∆l2 gi,j (d, α)
∆l=0 |i−j]=∆l

with ∆l beeing the local grey level difference.

c) Homogeneity
L−1 L−1
X X 1
h(d, α) = gi,j (d, α)
i=0 j=0
1 + |i − j|

Acts approximately inversely to the computation of contrast: Homogeneous regions (i.e. i = j )


1
are weighted strongly, other areas less strongly ( < 0 if z > 0).
1+z
Can be rewritten as:
L−1
X 1 X
c(d, α) = gi,j (d, α)
1 + ∆l
∆l=0 |i−j]=∆l

(with ∆l beeing the local grey level difference).

4.2.2 Spectral Texture Representation: Overview

Basis: (Fourier) Spectrum decomposes image into periodic funcions (sines and cosines) of different
wavelength and amplitude.

Idea: Capture peroidic grey-level patterns within textures by frequency based representations.

Note: Similarly to GLCMs not the raw spectrum but features derived from it are used for represent-
ing/characterizing textures!

Examples of features of the spectrum useful for texture representation:

a) Prominent peaks ⇒ give the principal direction of the texture patterns

b) Location of prominent peaks in the frequency plane ⇒ represents fundamental (spatial) period
of texture patterns

Extracting such features can be simplified by expressing the spectrum S(u, v) in polar coordinates as
a function S(r, θ).
For each radius r or angle θ the spectrum may be considered as a 1D function Sr (θ) or Sθ (r), respec-
tively.
More global representations can be obtained by integration (acually summation in the discrete case)
of those functions: π
S(r) = Sθ (r)
θ=0
X
June 4, 2012 38
and
R0
X
S(θ) = Sr (θ)
r=1

(with R0 is maximal radius of a circle centered at the origin; because of symmetry only angles between
0 and π need to be considered)
Slide: Periodic textures and derived frequency-based representations (Fig. 50)

4.3 Motion cf. [Hor81]

When an object within a scene moves relative to the position of the camera (or when the camera
moves relative to the scene) the 2-dimensional projection onto the image plane f (x, y) is additionally
dependent on the time t, yielding f (x, y, t).
Motion in the image plane (!) can be estimated by finding corresponding pixels in subsequent images
(=
ˆ displacement wrt. x/y-direction).

Note: Motion in 3D can only be estimated if, additionally, depth information is available!

Potential Methods:

• Optical flow: implicit motion estimation


• Techniques based on matching: explicit motion estimation either of image pagches (=
ˆ
block-matching) or of so-called keypoints (cf. Section 5.4)

4.3.1 Optical Flow

Idea: Infer information about motion within a scene from changes in grey-level structure.

Assumptions: During motion ...

• ... the orientation of surfaces relative to illumination sources is invariant (or the illumina-
tion is uniform) ...
• ... and the orientation relative to the observer is constant (or the grey-level is independent
of a change in orientation).

Sketch of principal situation:

June 4, 2012 39
Image at time t ... and at time t + dt

dy
dx

f(x, y, t) (x, y)
f(x + dx, y + dy, t + dt)

On the level of grey values then the following holds:

f (x, y, t) = f (x + dx, y + dy, t + dt)

An expansion of the expression on the right hand side into a Taylor series at the point (x, y) yields:

∂f ∂f ∂f
f (x, y, t) = f (x, y, t) + dx + dy + dt + {residual}
∂x ∂y ∂t

When ignoring terms of higher order within the Taylor expansion (i.e. the “residual”) one obtains
(with fx = ∂f
∂x
etc.):
fx dx = fy dy + ft dt = 0

With the definition of velocities in x/y-direction u = dtdx, v = dy


dt
this results in the so-called motion
constraint equation:
Em = fx u + fy v + ft = 0

4.3.2 Estimation of Optical Flow

Problem: Optical flow (i.e. velocities u and v) is not uniquely defined by the motion constraint
equation Em .
However, assuming that every point in the image could move independently is not realistic.

Constraint: Usually opaque objects of finite size undergoing rigid motion are observed ⇒ neighbor-
ing points in the image will (most likely) lie on the same object (surface) and, therefore, have
similar velocities.
⇒ Optical flow is required to be smooth

June 4, 2012 40
This smoothness constraint can be expressed by minimizing (the square of) the magnitude of the
gradient of the flow velocities:
 2  2  2  2
2 ∂u ∂u ∂v ∂v
Es = + + + → min!
∂x ∂y ∂x ∂y

The motion constraint equation and the smoothness constraint should hold for all positions in the
image. Due to violation of the assumptions and noise in the image in practice this will, however,
never be exactly the case.
Therefore, a combined constraint can be formulated as minimizing the following total error:
Z Z
2
E= (E m + α2 Es2 )dxdy

where α2 is a suitable weighting factor.

Using variational calculus one can show that a neccessary condition for an extremum (minimum) is
given by:
fx2 u + fx fy v = α2 ▽2 u − fx ft
fx fy u + fy2 v = α2 ▽2 v − fy ft

Note: The Laplacian ▽2 u (or ▽2 v) can be approximated in the discrete case as:

▽2 u = β(ūi,j,k − ui,j,k)

with ūi,j,k beeing a suitable average over some local neighborhood, e.g. 4-neighborhood:
1
ūi,j,k = (ui−1,j,k + ui,j−1,k + ui+1,j,k + ui,j +1,k )
4
The proportionality factor β then needs to be set to 4.

The constraint equations can then be rewritten as:

(α2 β + fx2)u + fx fy v = (α2 β ū − fx ft )


fx fy u + (α2 β + fy2)v = (α2 β v̄ − fy ft )

Further rearanging of terms yields (after some lengthy derivations) the following form, which defines
u, v based on the gradients fx , fy , ft and the local averages ū, v̄:

fx (fx ū + fy v̄ + ft )
u = ū −
α2 β + f x2 + fy2
f (f ū + fy v̄ + ft )
v = v̄ − y 2x
α β + fx2 + fy2

Remarks:

• Choice of α2 plays only a siginificant role in areas with small gradient.

June 4, 2012 41
• Strucutre of equations can be used to define an interative procedure for computing estimates of
the flow velocities, i.e. by calculating new estimates (un+1, v n+1 ) from estimated derivatives
and local averages of previous velocity estimates:
fx (fx ūn + fy v̄ n + ft )
un+1 = ūn −
α2 β + fx2 + f 2y
f (f ūn + fy v̄ n + ft )
v n+1 = v̄ n − y x2
α β + fx2 + fy2

(Procedure can be initialized by assuming a uniformly vanishing flow field, i.e. velocities u =
v = 0.)

Slide: Example of optical flow computation (Fig. 51)

4.4 Depth cf. also [For03, Chap. 11]

Problem: When mapping a 3D scene onto a 2D image (plane) information about depth (i.e. distance
of a scene point from the image plane) is lost!

Potential Methods for obtaining depth information:

• Measuring depth, e.g. by so-called laser range finders


• Use of structured light, i.e. superimposing a light pattern of known geometry onto the
scene
• Using multiple views of the scene (e.g. two images ⇒ stereo vision) to estimate depth

(Note: The first two methods are active, the third one passive.)

4.4.1 Stereo Vision

Principle: Two (or more) images/views of a scene are captured from different positions/view-points.
A scene point it thus mapped onto two corresponding image points PL and PR in the two stereo
images (if it’s not occluded).
If the geometry of the camera configuration is known, depth information can be recovered from
the position of the corresponding image points.

Problem:

• How to determine parameters of the stereo configuration?


(often simplified setups are used)
• How to find corresponding scene points?
⇒ similar to motion estimation

June 4, 2012 42
Simplified Stereo Setup

• Parallel optical axes

• normalized images (stereo baseline parallel to x-axis, no optical distortions)

⇒ relevant parameters:

• width of stereo baseline

• focal length of camera (≈ distance of image plane to optical center)

• (size, distance of image pixels)

⇒ corresponding image points differ in x-coordinate only!


(Note: Additionally, a mapping of camera coordinates to a world coordinate system may be required.)
Slide: Simplified stereo setup (Fig. 52)
Remarks:

• Camera coordinate system (XC , YC , ZC ) refers to left camera

• Coordinates of sene point P (xC , yC , zC ) already in 3D camera coordinates

The following relations between camera (3D) and image coordinates (2D) can be obtained:
! ! ! !
xL f xC xR f xC + b
= and =
yL zC yC yR zC yC

The distance of corresponding image points is called disparity (obtained from the above relations):

fb
d = xR − xL =
zC

The 3D coordinates of a scene point can be recovered as follows:


   
xC xL
  b 
 yC  =  yL 
d
zC f

Problem: Correspondence between image points needs to be established!

Methods: Similar as for motion: Optical flow, Block-matching

June 4, 2012 43
4.4.2 Estimation of Stereo-Correspondence via Optical Flow

Let’s consider a trivial image sequence consisting of left and right image by defining:

f (x, y, 0) = fL (x, y ) and f (x, y, 1) = fR (x, y )

Using the simplified stereo setup (as of above) one obtains the following constraint equation:

Ec = fx u + ft = 0

(with ft beeing the partial derivative of f () in the “direction of the stereo setup” and u corresponding
to an estimate of the disparity d).

Note: No displacement/“motion” in y-direction, as images are aligned.


When ideal solution is not possible/can’t be computed, additional constraints must be used, e.g.
smoothness of the displacement field:

Es2 = u2x + uy2 → min!

Problem: Assumption of general smoothness is violated especially at boundaries of objects!

Solution: Use so-called “directed” smoothness, which considers changes in the displacement field
perpendicular to the grey-level gradient.

The grey-level gradient is defineds as: !


fx
fy

Perpendicular to the gradient is the vector:


!
fy
−fx
!
ux
The amount of change in displacement in the direction of some vector is given by the pro-
uy
jection onto that vector: !
ux
(fy − fx ) = fy ux − fx uy
uy

Consequently, the following constraint realizes “directed” smoothness:

Ed s2 = (fy ux − fx uy )2 → min!

All constraints can be integrated when minimizing the following total error:
Z Z
E= Ec + α2 (Es2 + β 2 Ed s2 )dxdy

June 4, 2012 44
Chapter 5

Image Primitives

The basis for recognizing objects in images / describing scenes is usually formed by more elementary
“components” – so-called image primitives.

Goal: Partitioning of images into meaningful primitives that form the basis for the extraction of
objects of interest (for a specific application) within (simple or complex) scenes.

Question: What are “meaningful” image primitives in general?

Examples for scenes / scene complexity:

a) Simple artificial or industrial scenes


(rigid objects before a known or even homogeneous background, controlled lighting con-
ditions)
b) “simple” natural scenes
(mostly indoors [e.g. office environment], environment and lighting conditions controlled
to some extent)
c) Complex natural scenes
(heavily textured, outdoors, unknown lighting conditions, ...)

5.1 Fundamentals

The most widely used (i.e. “established”) image primitives result from image segmentation, where
the goal is to segment a given image into a set of regions or contours.

• Regions =
ˆ homogeneous image areas (roughly corresponding to objects or object surfaces)

• Contours =ˆ local discontinuities / inhomogeneities, i.e. edges (roughly corresponding to bound-


aries of objects or object surfaces)

45
Note: The process of segmenting an image into regions and contours is approximately dual, i.e.
region boundaries are approximately equivalent with contours and contours enclose regions.

A rather “modern” type of image primitives are so-called keypoints (or interest points). A set of key-
points defines a sparse representation (wrt. segmentation) of image content by extracting “interesting”
points/positions within an image, usually augmented by a local feature representation.

5.2 Region Segmentation cf. [Gon02, Sec. 10.4]

Assumptions:

• Regions are topoligically connected

• A region is a homogeneous image area ⇒ A predicate P can be defined, that is true for one
specific region and false if the region is extended by neighboring pixels (=
ˆ homogeneity crite-
rion)

The following conditions must be satisfied:

S
1. i Ri = I, i.e. the set of regions Ri segment (or describe) the whole image I

2. Ri ∩ Rj = ∅ for i 6= j, i.e. regions don’t overlap, are distjoint sets of pixels

3. Ri is topologically connected (i.e. for every pair of pixels in Ri there exists a connecting path
of neighboring pixels within Ri )

4. P (Ri ) = TRUE, i.e. every region satisfies the homogeneity criterion P

5. P (Ri ∪ Rj ) = FALSE fori 6= j and topologically neighboring regions Ri and Rj

Example for a (simple) homogeneity criterion P :


(
TRUE if |fjk − flm | ≤ d for all pairs of pixels (j, k) and (l, m) in Ri
P (Ri ) =
FALSE otherwise

Note: A homogeneity criterion can be defined on any feature of a pixel i.e. grey level, color, depth,
velocity, ...

June 4, 2012 46
5.2.1 Region Growing

Basic Method: Starting from “suitable” initial pixels regions are iteratively enlarged by adding neigh-
boring pisxels.

Problems:

• Finding starting points for region growing process (seed points)


– ideal: exactly one per region
– salient pixels, where saliency needs to be defined by an additional criterion (bright-
est/darkest pixel, high contrast pixel, ...)
– all pixels ⇒ enlarging a region leads to the merging of two regions (Which of the
possible operations is the most suitable one?)
• Ordering of performed region growing operations

5.2.2 Splitting

Idea: Starting from a single initial region (i.e. the whole image) regions are subsequently subdivided
according to some scheme until all regions satisfy the homogeneity criterion P .

Principle Method:

1) Initialize the set of image regions to a single region R0 = I


2) While not all regions Ri satisfy P (Ri )
2a) Choose some Ri with P (Ri ) = FALSE
2b) Subdivide Ri with a suitable method into regions Rik
2c) Replace Ri in the set of image regions by {Ri1 , Ri2 , ...RiNi }

Problem: Suitable method for subdividing a region!

Possible Solution:

• Calculate grey-level histogram of the region


• Determine a binarization threshold (e.g. by approximation of the grey-level distribution by
a mixture of two Gaussian distributions or by selection of the most prominent minimum
in the histogram)
• Binarization of the source region Ri
• Connected components/areas form sub-regions Rik of Ri

June 4, 2012 47
5.2.3 Split-and-Merge

Motivation: Exploit advantages of region splitting and growing (respectively merging) methods by
combining both techniques

Basic Idea: Starting from an initial segmentation suitable splitting and merging operations are ap-
plied until a final segmentation is reached which satisfies all criteria.

Method (after Pavlidis): Split and merge operations operate on a quad-tree representation of the
image
Slide: Quad-tree representation of image segmentation (Fig. 53)

Split-and-Merge Algorithm (Pavlidis)

1. Calculate initial segmentation


(e.g. randomly, based on prior knowledge [e.g. about the position of an object, or by selecting
an appropriate segmentation level in the quad-tree representation)
Slide: Simple sample image (Fig. 54) and initial segmentation (Fig. 55)

2. Perform all possible region merging operations (within the quad-tree structure!)
(i.e. eventually, 4 nodes in the quad-tree will be replaced by their parent node)
Slide: Result of possible merge operations on the sample image (Fig. 56)

3. Perform all possible regions splits, for regions (i.e. quad-tree nodes) that don’t satisfy the
homogeneity criterion P (Ri )
Slide: Result of possible splitting operations on the sample image (Fig. 57)

4. Merge adjacent regions that - when merged - satisfy the homogeneity criterion P (Ri ∪ Rj ).
Note: In this step the algorithm goes beyond the quad-tree structure of the image.
Slide: Result of merging operations outside the quad-tree structure for the sample image (Fig. 58)

5. Eliminate very small regions by merging them with the neighboring region with most similar
grey value.
Slide: Final segmentation for result the sample image (Fig. 59)

5.3 Contour Extraction

Baseline: Edge detection (i.e. contour extraction starts from edge images)

June 4, 2012 48
Note: Methods for edge detection generate after the application of a suitable thresholding op-
eration a set of edge pixels where significant changes in grey level occur (per pixel a
measure for the strength and possibly direction of the edge).

Problems:

• Edges are defined on the pixel level only


⇒ Contours need to be built from sets of edge pixels defining an object boundary
• Real contours (e.g. boundaries of objects) are generally several edge pixels wide.
• From image noise many non-edge pixels originate with significant changes in grey level,
i.e. high edge strength.

Descriptions of contours can be

• non-parametric, i.e. linked lists of edge pixels or


• parametric representations, i.e. geometric shapes (lines, curves, [partially defined] cir-
cles/ellipses) derived from underlying edge pixel sets

5.3.1 Contour Following cf. also [For03, ??]

Goals: • Extraction of significant edge elements (ideally exactly one edge pixel in
the direction of the contour),

• grouping of edge pixels, and

• approximation by a parametric function.


Processing Steps:

1. Thinning (of the edge image)


2. Linking (of edge elements)
3. Approximation

Note: Linking & Approximation can also achieved in an integrated manner via the Hough tranformր.

Thinning

... can e.g. be realized by the following two processing steps:

June 4, 2012 49
1. Non-maximum suppression

Goal: Reduction edges to a width of 1 pixel only.

Basic Method: Eliminate edge elements if the edge intensity is smaller than that of another edge
element in the direction of the gradient (i.e. perpendicular to the edge/potential contour, two
neighbors must be considered)

2. Hysteresis thresholding

Goal: Elimination of non-significant edge elements

Idea: Thresholding with a single threshold usually produces bad results ⇒ use two thresholds!

Basic Method:

• Set two thresholds θlow and θhigh


(suitable choice e.g.: θhigh = 0.1 · maxi,j {gmag (i, j)} and θlow = 0.3 · θhigh )
• If the local edge intensity gmag (i, j) is larger than θhigh : keep edge element.
• If the local edge intensity gmag (i, j) is smaller than θlow : eliminate edge element.
• If the local edge intensity lies between the two thresholds: keep edge element only, if it is
connected to another edge element with gmag (i, j) > θhigh .

Note: Applying the Sobel-Operator (Gaussian-smoothed gradient calculation) followed by non-maximum


suppression and hysteresis thresholding constitutes the so-called Canny Edge Detector.
Slide: Example for results of Canny Edge Detector (Fig. 60)

Linking

Goal: Explicit representation of neighborhood relations between edge elements

Idea: Start with a randomly chosen edge element and extend sequence of linked edge elements by
one neigbor at a time. Repeat until all edge elements are covered by some linked set.

Problems:

• Contours counld be broken up into two (partial) chains of edge pixels (if started “in the
middle”).
⇒ extend edge pixel chain in both directions
• At junctions multiple continuations for building edge pixel chains are possible.
⇒ Continue contour in direction with most similar gradient.

June 4, 2012 50
Approximation

Goal: Approximation of linked sets of edge elements by a parametric function (e.g. line, circle, ...)

Method (for polygonal approximation [Ram72], i.e. with line segments):

• Connect starting and ending point of pixel chain by a line.


• Find edge element with maximum distance to this line.
• Split up line at this point, if distance is too large (threshold!).
• Recursively process all segment approximations until no more line splits are necessary.

5.3.2 Hough-Transform cf. [Gon02, Sec. 10.2.2]

... does not try to approximate given edge pixels by a parametric representation but investigates the
parameter space of approximations of all edge elements.

a) Simple Case: Approximation by a staight line

June 4, 2012 51
Baseline: A straight line can be represented with parameters
α and r as follows:

r = x cos α + y sin α
{
r
}|

For any point (xi , yi ) then the following holds:


z
α

r = xi cos α + yi sin α
r
This expression can be considered as a function of α. It then for given (xi , yi)
represents an sinusoidal curve in the α-r -plane.

Advantage (over the canonical representation of staight


lines as y = ax + b):
α
The space of parameters is finite!
√ √
0≤α<π and − 2·M ≤r < 2·M

(for M ×M image; other choices possible, e.g. the one


used in [Gon02, Sec. 10.2.2])

Method

– For all edge pixels (xi , yj ) “draw” the associated curve (or function) in the α-r -plane.
– The intersection points of all curves drawn represent parameters (αk , rk ) of lines which
approximate the associated edge pixels.

June 4, 2012 52
Simple example:

r

2·M
(x2 , y2 )
(x1 , y1 )

rs (x3 , y3 )

α
0 αs π


− 2·M

⇒ The line rs = x cos αs + y sin αs passes through the points (x1 , y1 ), (x2 , y2 ), and
(x3 , y3 ).
Problem: Determining intersection point mathematically exactly in practice too difficult or
unusable (slight deviations due to noise, quantization, and errors in the edge detection
process).
Solution: Just as the image plane the α-r-plane needs to be represented digitally, too. There-
fore, the ranges of α and r are quantized resulting in an α-r-matrix. The cells of this
matrix can be thought of a accumulators that are incremented whenever a sinusoidal curve
for some edge pixel passes through the accumutor’s parameter range.
Reconstruction of Contours: Salient contours in the image are found by selecting α-r pa-
rameter pairs with high accumulator counts and checking the associated edge pixels for
continuity.

Slide: Illustration of the Hough transform (Fig. 61)

Slide: Example of the Hough transform on sample infrared image (Fig. 62)

June 4, 2012 53
b) Parametric Case
The Hough transform can be applied in all cases where a parametric representation of the curve
used for approximating contours is possible in the form:

g(x, y, c) = 0

Here (x, y) represents image coordinates and c an arbitrary parameter vector.


E.g. for the approximation of contours with (arcs of) circles:

(x − c1 )2 + (y − c2 )2 = c32

Note: Extension for non-parametric functions possible


⇒ Generalized Hough transform (cf. [Bal82])

5.4 Keypoints

The use of so-called keypoints for image/object description requires ...

• a method for detecting keypoint locations and ...

• a method for describing the local image properties at the keypoint location as uniquely as pos-
sible for later retrieval.

5.4.1 The Scale-Invariant Feature Transform (SIFT) after [Low04]

... keypoint detection method invented (and patented!) by David Lowe.

Goal: Find keypoint locations and appropriate descriptions that are (approximately) invariant with
respect to scale, rotation and — to some extent — view-point changes.

Solutions:

1. Scale invariance → detect keypoints in scale space


2. Rotation invariance → determine local orientation at keypoint location
3. “view-point invariance” → keypoint description based on local gradients

June 4, 2012 54
Scale-Space Representation of Images

Different scale representations L(x, y, σ) of an image I(x, y) are obtained via a convolution with a
Gaussian G(x, y, σ) with variable scale (i.e. standard deviation) σ, where

L(x, y, σ ) = I(x, y) ⋆ G(x, y, σ )

and
1 −(x2 +y2 )/2σ 2
G(x, y, σ) = e
2πσ 2
The scale space of an image I(x, y) is then defined by the sequence of Gaussian-smoothed versions
L(x, y, kn σ) for some σ, a constant multiplicative factor k and n = 0, 1, 2, ....

Note: With every doubling of the scale kn σ (i.e. every so-called octave) the scale space image
L(x, y, kn σ) can be subsampled by a factor of 2 without loosing information.

Keypoints are detected as the local extrema in the Difference of Gaussian (DoG) representation of
scale space, i.e.:

D(x, y, σ ) = I(x, y) ⋆ (G(x, y, kσ ) − G(x, y, σ ))


= L(x, y, kσ) − L(x, y, σ)

⇒ can easily be computed from difference of neighboring scales L(x, y, kn σ).


Slide: Scale space representation scheme for images (Fig. 63)

Keypoint candidates are defined as the maxima or minima in the DoG images by comparing a pixel
position to its neighbors in 3 × 3 regions in the local and both adjacent scales.
Slide: Determining extrema in DoG image representations (Fig. 64)

Slide: Example of keypoints detected (Fig. 65)

Keypoint Localization and Filtering

Keypoint locations are determined with sub-pixel accuracy by fitting a 3D quadratic function to the
local sample points and determining the location of the interpolated extremum.
Furthermore, keypoint candidates are discarded that:

• have low contrast or

• lie on edges.

June 4, 2012 55
Keypoint Orientation

After interpolation a keypoint can be associated with the Gaussian-smoothed image L(x, y, σ) at the
scale σ closest to the keypoint.
The local orientation at the keypoint (x0 , y0 ) is then determined from a histogram of gradient orien-
tations (resolution of 10 degrees, i.e. 36 bins, weighted by gradient magnitude), which is computed
in the region around the keypoint over the gradients of L(x, y, σ) (weighted by a circular Gaussian
window with σ ′ = 1.5σ ).
⇒ Local maxima (peaks) in the histogram correspond to dominant local orientations.
Keypoints are created for all local maxima in the gradient histogram which are within 80% of the
global maximum (i.e. one candidate location might be used to create multiple keypoints).

After keypoint creation peak positions are interpolated by a parabola over three adjacent bins in the
orientation histogram.

Keypoint Descriptor

• (Image) Gradient magnitudes & orientation are sampled around the keypoint (at scale of key-
point, weighted by Gaussian)

• Orientations are rotated relative to keypoint orientation (≈ rotation invariance)

• Orientation histograms are created over m × m sample regions (e.g. 4 × 4 regions with 8 × 8
sample array)

⇒ Concatenation of histograms defines the keypoint descriptor (usually 4 × 32 = 128 bytes)

Slide: Scheme for computing SIFT keypoint descriptor (Fig. 66)

Note: Keypoint descriptors can be matched (nearest neighbor) accross images in order to identify –
to some extent – identical/similar locations at different scales/rotations/view points.

Slide: Correspondences between matched keypoints (Fig. 67)

June 4, 2012 56
Chapter 6

Appearance-Based Object Recognition

In contrast to object recognition based on image primitives (regions, contours) and some method for
finding appropriately structured configurations of those that correspond to objects, in appearance-
based approaches objects are represented – more or less – by using image data directly (e.g. different
views of an object).
Known instances of objects can then be identified by matching the image-based representations to
new data. Using an appropriate similarity criterion also objects from an object category can be found.

6.1 Template Matching cf. also [Gon02, pp. 205-208]

... simplest case of an appearance based model.

• Build “model” of an object by extracting a (rectangular) image area (the template f ) showing
the desired object.
Note: Also multiple templates (=
ˆ views) per object can be used.

• For finding (this object, a similar one) in a new image g compute “similarity” between the
template and the new image at every position in the image.
Common similarity measure: cross-correlation
M −1 N −1
1 XX
h(x, y) = f (x, y ) ◦ g(x, y ) = f (m, n)g(x + m, y + n)
MN m=0 n=0

Note: Similar to convolution!

• Position(s) where object is hypothesized correspond to (local) maxima of the cross-correlation


measure h.

Note: As the cross-correlation measure not only depends on the similarity between template and
image but also on the local brightness of the image, an appropriate normalization is necessary
in practice (see the tutorialsր).

57
6.2 Matching Configurations of Keypoints cf. [Low04]

Note: Matching of single keypoints not reliable enough for object detection / recognition

Basic Idea: Match sets of keypoints (lying on desired object) and verify correct spatial configuration
of matches

Simple Method: When assuming no distortion of keypoint configuration (i.e. no deformation of


object due to e.g. out-of-plane rotations) use generalized Hough transform

1. Model of object:
(a) Set of keypoints (specified by their descriptors [including local orientation and scale!])
(b) Reference point on object
(c) For each kepoint: Vector to reference point relative to keypoint orientation and scale
2. Matching process:
(a) For every matching keypoint determine reference point candidate (exploit local scale
& orientation of matched keypoint)
(b) Vote for all reference point candidates associated with keypoint matches
(c) Reference point(s) with highest number(s) of votes determines location of object hy-
pothesis/es

6.3 Eigenimages [Tur91], cf. also [For03, Sec. 22.3.2]

Idea: Relevant information for representing a class/set of known objects should be automatically
derived from sample data.

Basic Abstraction: Sample images are considered as points in a high-dimensional vector space (of
images).

Goal: Find suitable representation of the “point cloud” in the image space that represents the known
objects.
Derive appropriate similarity measure for finding known object instances or objects from the
modeled category.

Most well known example: Eigenfaces, i.e. application to the problem of face detection/identification

6.3.1 Formal Problem Statement

• Given a set of points/vectors (samples) in some high-dimensional vector space

• Determine sub-space that is defined by the sample set

June 4, 2012 58
• Representation of sub-space possible via center of gravity (i.e. mean vector) of samples and
sample covariance matrix
⇒ Principle components of covariance matrix (i.e. its eigenvectors) span sub-space

• Every sample (i.e. vector in the sub-space) can be reconstructed via a linear combination of the
eigenvectors

• Other images (vectors) can be approximated.

Principle Method
Building the Model:

1) Collect a set of sample images from the class of objects to be modeled (e.g. face images)
Slide: Example of sample set of face images (Fig. 68)

2a) Compute the principal components (Eigenimages) of the sample set and
Slide: Example of Eigenfaces obtained (Fig. 69)

2b) Select the K eigenvectors corresponding to the largest eigenvalues for representing the data

3) For all known instances (individuals) compute the projection onto the modeled sub-space (which
is spanned by the selected eigenvectors).
⇒ on K-dimensional vector of weights per instance
Note: Modeling quality can be assessed by inspecting reconstructions of the known data.

Recognition using the Model:

June 4, 2012 59
4) For an unknown image (e.g. a new face image) compute its distance to the modeled sub-space
(e.g. face-space).
⇒ Reject (i.e. none of the known objects) if too large.
Note: When searching a larger image for smaller realizations of the known objects many pos-
sible sub-images at different scales need to be considered!
Slide: Example of face/non-face classification using image reconstruction via Eigenfaces (Fig. 70)

5a) Compute projection onto the sub-space (see above) and

5b) Classify the resulting vector as known or unknown instance given the projections of the (known)
sample data (e.g. by mapping to the nearest neighbor in the projection space).

6.3.2 Computation of Eigenfaces

... or in general: Eigenimages (for some known object category)

• Let I(x, y) be an M × M grey-level image of a face


⇒ can be consideres as a vector of length M 2

• Let Γ = {Γ1 , Γ2 , ...ΓN } be the set of training (face) images

• The average face image Ψ is given by:


N
1X
Ψ= Γn
N
n=1

The deviation of a face image from the average is, therefore: Φn = Γn − Ψ (i.e. the set
Φ = {Φ1 , Φ2 , ...ΦN } corresponds to face images normalized to have zero mean).

• For the set of difference images Φ perform a Principal Component Analysis (PCA)
⇒ set of K < N (maximum value of K = N −1 if all Ψ are linearly independent) orthonormal
vectors ui , u2 , ...uK
i.e. (
1 if k = l
ukT ul =
0 otherwise
(where {·}T denotes vector/matrix transpose) and
N
1 X
λk = (ukT Φn )2
N | {z }
n=1
projection of Φn ontouk

is maximum.

June 4, 2012 60
• The vectors uk and scalars λk are eigenvectors and eigenvalues, respectively, of the covariance
C matrix of the normalized images:
N
1X
C= Φn ΦTn = AAT with A = [Φ1 , Φ2 , ...ΦN ]
N
n=1

Problem: The covariance matrix C has dimension M 2 × M 2 (i.e. for e.g. 512 × 512 images
5122 × 5122 = 262144 × 262144)
⇒ computation of eigenvalues/vectors extremely problematic
Solution: As there are only N images in the training set, that is used for computing C, a
maximum of K = N − 1 eigenvectors of C exist (N − 1 << M 2 ).
Compute eigenvectors of AT A (dimension: N × N )

AT Avi = µi vi |A · [...]
T
AA Avi = Aµi vi = µi Avi
|{z}

⇒ Avi is eigenvector of AAT


Calculation of uk :
– Construct matrix L = AT A with Lij = ΦiTΦi
– Compute eigenvectors/values of L
– Eigenvectors of C are obtained as:
N
X
uk = Avk = [Φ1 , Φ2 , ...ΦN ]vk = vkn Φn
n=1

• A known face image Γ can now be represented by its projection onto face space:

ωk = uTk (Γ − Ψ) k = 1...K

with ΩT = [ω1 , ω2 , ...ωK ]

• Any image can be reconstructed (in general only approximately) from its projection onto face
space
With Φ = Γ − Ψ the reconstruction of Φ in face space is:
K
X
Φ̂ = ωk uk
k=1

which gives an overall recostructed image Γ̂ = Φ̂ + Ψ

• Non-face images can be rejected based on their distance to face space (i.e. the quality of the
reconstruction):
dface = ||Φ − Φ̂||2

June 4, 2012 61
• Face images of known (or possibly unknown) individuals can be identified based on the simi-
larity of their projections to those of known individuals from the training set.
Simple method: chose individual whose projection Ωj has minimum euclidian distance to pro-
jection Ω of query image:
j = argmin ||Ω − Ωk ||2
k

Problem: How to reject face images of unknown individuals?

Slide: Processing steps necessary for implementation of the Eigenface approach (Fig. 71)

June 4, 2012 62
Chapter 7

Tracking
after [For03, Chap. 17]

7.1 Introduction

When objects move in a scene in general a sequence of images is required in order to draw inferences
about the motion of the object. The situation is similar when the camera (i.e. the observer) moves
through a scene. Then the motion of the observer can be inferred. This problem is known as tracking.

Application areas are, e.g.:

• Targeting (in the military domain): try to predict an object’s (target’s) future position in the
attempt to shoot it.
Note: Usually radar or infrared images are used.

• Surveillance: motion patterns of e.g. people on a parking lot are used to draw inferences about
their goals (e.g. trying to steal a car)

• Automotive: traffic assistance systems (e.g. for lane-keeping, adaptive cruise control) infer
motion of lane marks or other vehicles

• Motion capture: special effects in movies sometimes rely on the possibility to track a moving
person accurately and to later map the motion onto another - usually artificially generated -
actor

In order to address the tracking problem the following is required:

• A model of the object’s motion (i.e. it’s internal state) and

• some set of measurements from the image sequence (e.g. the object’s position estimate)

We will consider drawing inferences in the linear case only, i.e. the motion model and the measure-
ment are linear.

63
7.2 Tracking as an Inference Problem

... more specifically a probabilistic inference problem.


The object is described by its internal state Xt (at time t or in the t-th frame of the image sequence,
respectively). Measurements are taken according to a random variable Yt (actual value: yt ).
Three main problems to be solved:

• Prediction: from past measurements y0 , y1 , ...yt−1 predict interal state at t-th frame:

P (Xt |Y0 = y0 , ...Yt−1 = yt−1 )

• Data association: from multiple measurements (e.g. position estimates) at frame t “select” the
“correct” one. Can be achieved based on the prediction of Xt .
Possible methods:

– Selecting the nearest neighbor, i.e. from several measurements ytk choose the one max-
imising P (ytk |y0 , ...yt−1 ).
– Perform gating (i.e. exclude measurements that are too different from the prediction) and
probabilistic data association (i.e. a weighted sum [according to the prediction probabil-
ity] of the gated measurements):
X
yt = P (ytk |y0 , ...yt−1 )ytk
k

• Correction: correct the internal state incorporating the new measurement, i.e. compute:

P (Xt |y0 , ...yt−1 , Yt = yt )

Necessary independence assumptions (in order to make things tractable):

• Only the immediate past matters, i.e.:

P (Xt |X0 , X1 , ...Xt−1 ) = P (Xt |Xt−1 )

• Measurements depend only on the current state, i.e. not on other measurements taken:

P (Yt , Yj , ...Yk |Xt ) = P (Yt |Xt )P (Yj , ...Yk |Xt )

⇒ Inferences for tracking have the structure of a Hidden Markov Model!

June 4, 2012 64
7.3 Kalman-Filter

We consider the following linear dynamic model of motion:

• All probability distributions are Gaussians, i.e. can be represented by their mean and associated
covariance matrix.

• An estimate x̂ of the predicted state is obtained by a linear transform D (could be dependent


on time but usually is not):
x̂t = Dxt−1

which is the mean of P (Xt |Xt−1 ).

• The uncertainty of the prediction is described by the covariance matrix Σd (could be time
dependent).

• A measurement matrix M (could be dependent on time but usually is not) is used to convert
between internal state and measurements taken:

ŷt = M xt

• The uncertainty about the measurement process is represented by the covariance matrix Σm
(which could also be time dependent).

Note: The state vector Xt is normally distributed with mean x̂t and covariance matrix Σd . The
measurement vector Yt is normally distributed with mean ŷt and covariance Σm .

Examples of State Representations

... for different assumptions on the nature of the dynamic model.


Note: Measurements are usually 2D/3D position estimates p.

• (Quasi-)Stationary point: Internal state and measurements represent identical quantities, i.e.
M = I. Motion occurs only under random component, i.e. uncertainty of the measurement
(when this is assumed to be quite large, the model can be used for tracking if nothing is known
about the object’s dynamics).

• Constant velocity: The position can be predicted as

pt = pt−1 + ∆t · v

Velocity is added to the state representation, i.e. x = {pT , v T }T .

June 4, 2012 65
The dynamic model is then given by
( )
I ∆t · I
D=
0 I

and the measurement matrix is


M = {I 0}

• Constant acceleration: analog to the above with additional acceleration parameter a as com-
ponent of the state vector.

Kalman-Filtering Algorithm

Goal: Estimate Gaussian probability distributions describing the linear dynamic model optimally in
the sense of least mean squared error.

Processing Steps:
Distinguish between state representation estimates before (e.g. x̂−t ) and after (e.g. x̂+
t ) the
incorporation of a new measurement yt .


0. Assume some initial estimates of x̂−
0 and the covariance Σ0 are known.

1. Predict new internal state xt from past state applying the dynamic model of motion:

x̂t− = D x̂t−1
+

Σt− = DΣ+ T
t−1 D + Σ
d

(covariance combines predicted uncertainty and uncertainty of prediction process)


2. Correct the prediction taking into account the current measurement yt (Note: Data asso-
ciation needs to be solved separately!).
Compute Kalman-gain

Kt = Σt−M T (M Σt−M T + Σm )−1

which represents the ration between the uncertainty of the model (Σt−) and the uncertainty
of the measurement process (M Σt−M T + Σm ).
The innovation
yt − M x̂t−

represents the difference between estimated and measured position.


Depending on the Kalman-gain the innovation is used to correct the estimated state:

x̂+
t = x̂−t + Kt (yt − Mxˆ−
t )

Σ+
t = (I − Kt M )Σt−

June 4, 2012 66
Bibliography

[Bal82] D. H. Ballard, C. M. Brown: Computer Vision, Prentice-Hall, Englewood Cliffs, New


Jersey, 1982.

[For03] D. A. Forsyth, J. Ponce: Computer Vision: A Modern Approach, Prentice-Hall, Upper


Saddle River, NJ, 2003.

[Gon02] R. C. Gonzalez, R. E. Woods: Digital Image Processing, Prentice-Hall, Upper Saddle


River, NJ, 2. Ausg., 2002.

[Hor81] B. K. P. Horn, B. G. Schunck: Determining Optical Flow, Artificial Intelligence, Bd. 17,
1981, S. 185–203.

[Jäh02] B. Jähne: Digital Image Processing, Springer, Berlin, 5. Ausg., 2002.

[Low04] D. Lowe: Distinctive Image Features from Scale-Invariant Keypoints, Int. J. of Computer
Vision, Bd. 60, Nr. 2, 2004, S. 91–110.

[Nie90] H. Niemann: Pattern Analysis and Understanding, Bd. 4 von Series in Information Sci-
ences, Springer, Berlin Heidelberg, 2. Ausg., 1990.

[Nie03] H. Niemann: Klassifikation von Mustern, 2003.

[Ram72] U. Ramer: An Iterative Procedure for the Polygonal Approximation of Plane Curves, Com-
puter Graphics and Image Processing, Bd. 1, Nr. 3, 1972, S. 244–256.

[Sch01] D. Schlüter: Hierarchisches Perzeptives Gruppieren mit Integration dualer Bildbeschrei-


at Bielefeld, Technische Fakultät, okt 2001.
bungen, Dissertation, Universit¨

[Tur91] M. Turk, A. Pentland: Eigenfaces for Recognition, Journal of Cognitive Neuro Science,
Bd. 3, Nr. 1, 1991, S. 71–86.

67
June 4, 2012
68

Figure 1: Arial image of well know university


June 4, 2012
69

Figure 2: Traffic in the city of Taipei


June 4, 2012
70

Figure 3: Infrared image of North Amerika (Source: http://www.nnic.noaa.gov/SOCC/gallery.htm)


June 4, 2012
71

Figure 4: Multispectral LandSat image of Amazonas rain-forest region (Source: http://www.nnic.noaa.gov/SOCC/gallery.htm)


June 4, 2012

Figure 5: X-ray image of human chest (left) and hand (right)


72
June 4, 2012

Figure 6: X-ray image of “Lockman Hole”


73
June 4, 2012

Figure 7: Ultrasonic image of a human embryo


74
June 4, 2012

Figure 8: Depth image of in-door scene


75
June 4, 2012

Figure 9: “Stanley” (left) and “Highlander” (right) with sensors (Source: DARPA)
76
Figure 10: Surveillance of persons arriving at the parking lot bevore IBM’s J. T. Watson research
center

June 4, 2012 77
June 4, 2012

Figure 11: Example of face detection results (by Yann LeCun)


78
June 4, 2012

Figure 12: Example of face detection results for strange people (by Yann LeCun)
79
June 4, 2012

Figure 13: What do all these objects – except one – have in common? [Jäh02, p. 17]
80
Figure 14: Gestalt laws (after [Sch01])

June 4, 2012 81
June 4, 2012
82

Figure 15: Example of illusions in intensity perception


June 4, 2012

77 97 101 102 106 106 107 96 59 40 38 32 35 32 42 66 88 113 111 57 67 43 26


71 97 100 100 98 108 85 47 24 16 9 7 8 17 21 28 47 68 207 172 61 41 27
73 95 101 101 105 76 49 35 18 10 9 10 8 7 15 15 26 46 56 66 71 38 23
74 100 105 100 91 75 66 42 23 12 10 12 11 10 13 15 18 36 43 57 79 37 26
72 100 102 95 67 74 63 57 39 27 21 20 20 14 12 16 18 26 31 37 62 42 26
66 96 104 74 55 49 68 61 47 41 29 24 23 16 21 21 15 20 29 37 45 56 33
65 95 92 51 34 44 55 62 60 44 44 36 32 34 31 28 26 29 26 28 45 55 41
58 86 75 46 37 33 45 56 58 61 64 65 60 57 55 40 39 37 45 36 41 54 56
45 52 53 37 24 19 29 38 54 59 64 87 89 74 59 50 50 46 48 48 44 52 71
41 42 48 33 19 13 18 44 77 90 77 96 95 80 66 55 53 47 46 48 48 53 63
39 42 43 22 14 14 28 68 102 116 88 108 95 83 75 62 61 54 52 53 43 40 59
21 26 39 18 16 10 30 82 120 130 124 123 110 104 88 76 73 68 72 72 37 24 48
22 26 24 12 13 14 54 116 128 134 137 126 107 108 103 84 91 85 95 92 26 19 38
19 23 18 12 15 23 86 136 128 95 74 57 63 82 101 93 79 75 76 76 20 25 27
6 10 15 23 20 43 136 144 98 66 40 45 52 74 115 91 52 64 87 70 15 18 4
26 26 44 68 32 62 146 156 142 105 77 94 86 96 137 97 38 51 60 70 35 57 57
188 189 73 52 52 81 145 157 170 161 135 119 117 113 154 145 103 102 121 109 108 196 190
98 98 97 146 106 116 139 143 147 144 145 149 124 123 170 169 136 150 156 122 50 59 62
50 60 70 142 134 125 129 128 124 137 148 152 108 131 175 171 138 154 143 103 48 47 53
26 38 51 66 121 114 133 125 130 142 142 131 121 75 84 87 127 153 135 65 39 34 37
21 45 74 52 54 96 141 125 128 137 123 132 140 110 81 88 131 145 130 38 29 23 27
28 66 139 87 87 89 130 135 129 126 112 102 113 98 108 117 120 128 112 46 42 31 40
50 84 143 113 123 98 114 130 127 126 108 76 49 40 35 41 80 117 121 50 45 33 46
244 244 244 250 132 107 112 114 118 119 106 110 88 80 88 93 108 108 163 63 49 36 54
249 247 234 229 182 114 126 104 107 113 119 114 124 119 114 110 103 118 127 104 80 63 68
251 249 243 239 203 115 128 113 95 99 108 119 119 116 123 117 120 202 217 229 225 212 216
250 251 250 251 242 136 130 117 99 90 91 93 90 88 92 100 178 191 223 250 247 247 251
223 238 229 228 179 136 135 125 115 99 86 87 89 88 95 116 193 177 233 246 245 245 248
233 233 225 176 147 141 137 126 123 109 92 84 86 93 104 110 148 159 160 167 188 201 219
231 205 162 166 159 143 132 123 116 110 103 92 91 101 109 45 28 29 31 32 40 49 68

Figure 16: “Raw” digital image in numeric representation


83
June 4, 2012

77 97 101 102 106 106 107 96 59 40 38 32 35 32 42 66 88 113 111 57 67 43 26


71 97 100 100 98 108 85 47 24 16 9 7 8 17 21 28 47 68 207 172 61 41 27
73 95 101 101 105 76 49 35 18 10 9 10 8 7 15 15 26 46 56 66 71 38 23
74 100 105 100 91 75 66 42 23 12 10 12 11 10 13 15 18 36 43 57 79 37 26
72 100 102 95 67 74 63 57 39 27 21 20 20 14 12 16 18 26 31 37 62 42 26
66 96 104 74 55 49 68 61 47 41 29 24 23 16 21 21 15 20 29 37 45 56 33
65 95 92 51 34 44 55 62 60 44 44 36 32 34 31 28 26 29 26 28 45 55 41
58 86 75 46 37 33 45 56 58 61 64 65 60 57 55 40 39 37 45 36 41 54 56
45 52 53 37 24 19 29 38 54 59 64 87 89 74 59 50 50 46 48 48 44 52 71
41 42 48 33 19 13 18 44 77 90 77 96 95 80 66 55 53 47 46 48 48 53 63
39 42 43 22 14 14 28 68 102 116 88 108 95 83 75 62 61 54 52 53 43 40 59
21 26 39 18 16 10 30 82 120 130 124 123 110 104 88 76 73 68 72 72 37 24 48
22 26 24 12 13 14 54 116 128 134 137 126 107 108 103 84 91 85 95 92 26 19 38
19 23 18 12 15 23 86 136 128 95 74 57 63 82 101 93 79 75 76 76 20 25 27
6 10 15 23 20 43 136 144 98 66 40 45 52 74 115 91 52 64 87 70 15 18 4
26 26 44 68 32 62 146 156 142 105 77 94 86 96 137 97 38 51 60 70 35 57 57
188 189 73 52 52 81 145 157 170 161 135 119 117 113 154 145 103 102 121 109 108 196 190
98 98 97 146 106 116 139 143 147 144 145 149 124 123 170 169 136 150 156 122 50 59 62
50 60 70 142 134 125 129 128 124 137 148 152 108 131 175 171 138 154 143 103 48 47 53
26 38 51 66 121 114 133 125 130 142 142 131 121 75 84 87 127 153 135 65 39 34 37
21 45 74 52 54 96 141 125 128 137 123 132 140 110 81 88 131 145 130 38 29 23 27
28 66 139 87 87 89 130 135 129 126 112 102 113 98 108 117 120 128 112 46 42 31 40
50 84 143 113 123 98 114 130 127 126 108 76 49 40 35 41 80 117 121 50 45 33 46
244 244 244 250 132 107 112 114 118 119 106 110 88 80 88 93 108 108 163 63 49 36 54
249 247 234 229 182 114 126 104 107 113 119 114 124 119 114 110 103 118 127 104 80 63 68
251 249 243 239 203 115 128 113 95 99 108 119 119 116 123 117 120 202 217 229 225 212 216
250 251 250 251 242 136 130 117 99 90 91 93 90 88 92 100 178 191 223 250 247 247 251
223 238 229 228 179 136 135 125 115 99 86 87 89 88 95 116 193 177 233 246 245 245 248
233 233 225 176 147 141 137 126 123 109 92 84 86 93 104 110 148 159 160 167 188 201 219
231 205 162 166 159 143 132 123 116 110 103 92 91 101 109 45 28 29 31 32 40 49 68

Figure 17: “Raw” digital image in numeric and grey-scale representation


84
June 4, 2012

Figure 18: Overview of the electromagnetic spectrum; range of visible light enlarged
85
June 4, 2012

α1

β2 β1

α2 h

β2 γ β1
P2 P1

C lens radius

R
light ray

d2 d1

Figure 19: Paraxial refraction: A light ray through P1 is refracted at P (where it intersects the interface, i.e. the surface of the lens) and then intersects
the optical axis at P2 . The geometric center of the interface is C, its radius R, all angles are assumed small (after [For03, Fig. 1.8, p. 9]).
86
June 4, 2012

F′ O F

−y ′
f f

P′

z′ −z

Figure 20: A thin lens: Rays through O are not refracted, rays parallel to the optical axis are focused in F ′ . Also note the different in-focus image
points for object points at different distances (cf. [For03, Fig. 1.9, p. 10]).
87
June 4, 2012

2002
c R. C. Gonzalez & R. E. Woods

Figure 21: The human eye as an imaging system Schematic structure of the human eye (from [Gon02, Chap. 2])
88
2002
c R. C. Gonzalez & R. E. Woods

Figure 22: Schematic structure of the human eye (from [Gon02, Chap. 2])

June 4, 2012 89
June 4, 2012

2002
c R. C. Gonzalez & R. E. Woods

Figure 23: Distribution of rods and cones on the retina (from [Gon02, Chap. 2])
90
June 4, 2012
91

Figure 24: Response of cones to different wavelengths (from [Bal82, Chap. 2])
Figure 25: Chromaticity diagram (cf. [Gon02, Chap. 6])

June 4, 2012 92
June 4, 2012
93

2002
c R. C. Gonzalez & R. E. Woods

Figure 26: HSI color space (from [Gon02, Chap. 6])


June 4, 2012

Figure 27: Additive color mixing in the RGB model vs. subtractive color mixing in the CMY model (from )
94
June 4, 2012
95

Figure 28: Coordinate convention used with digital images (after [Gon02, Chap. 2], Note: There coordinate axes are swapped!)
June 4, 2012
x − 1, y

x, y − 1 x, y x, y + 1

x + 1, y

4-Neighborhood

x − 1, y − 1 x − 1, y x − 1, y + 1

x, y − 1 x, y x, y + 1

x + 1, y − 1 x + 1, y x + 1, y + 1

8-Neighborhood
96

Figure 29: Different definitions of pixel neighborhoods (left). Which “objects” are connected? (right) (after [J¨
ah02, p. 42])
June 4, 2012

Row Transfer

Serial Register

Pixel Transfer
97

Figure 30: Structure of a CCD device (after [For03, p. 16])


Figure 31: Basic image types: dark, light, low contrast, high contrast with correcponding intensity
histograms (from [Gon02, Fig. 3.15, Chap. 3])

June 4, 2012 98
Figure 32: Examples of histogram equalization on images from Fig. 31 with corresponding final
intensity histograms (from [Gon02, Fig. 3.17, Chap. 3])
June 4, 2012 99
Figure 33: Examples of position/orientation normalization based on normalization of image (grey
level) moments (from [Nie03, Fig. 2.5.7, Chap. 2])

June 4, 2012 100


June 4, 2012

Figure 34: Example normalisation of character slant by applying a shear-transform (from [Nie03, Fig. 2.5.8, Chap. 2])
101
Figure 35: Principle of filtering using masks (from [Gon02, Chap. 3]) Note: Coordinate axes are
swapped w.r.t. usual upper-left convention!

June 4, 2012 102


June 4, 2012

Figure 36: Examples of 1D functions f (x) and the corresponding power spectra |F (u)|, i.e. the magnitude of the 1D-DFT (from [Gon02, Chap. 4])
103
June 4, 2012

Figure 37: Example of a 512 × 512 image containing a 20 × 40 white rectangle and the associated centered logarithmic power spectrum log(1 +
|F (u, v)|) (from [Gon02, Chap. 4])
104
June 4, 2012

Figure 38: Example of erosion (combined with dilation)


105
June 4, 2012

Figure 39: Example of dilation (combined with erosion)


106
June 4, 2012

Original Average (5x5) Median (5x5)

Figure 40: Example for smoothing by averaging vs. smoothing via the median
107
Original Morphological Edge (5x5)

June 4, 2012

Erosion/Min (5x5) Dilation/Max (5x5)


108

Figure 41: Example for erosion (=minimum), dilation (= maximum) and edge filtering (“morphological edge”)
Original Opening (5x5)

June 4, 2012

Closing (5x5)

Erosion & Dilation (5x5)


109

Figure 42: Example for opening (erosion + dilation) and closing (dilation + erosion)
June 4, 2012 110
Figure 44: Examples of behaviour of derivatives at an ideal ramp edge (from [Gon02, Chap. 10])

June 4, 2012 111


Figure 45: Example of ramp edge corrupted by Gaussian noise with increasing variance showing
effect in 1st and 2nd derivatives (from [Gon02, Chap. 10])

June 4, 2012 112


June 4, 2012
113

Figure 46: Example of gradient in x- and y-direction and combined magnitude (from [Gon02, Chap. 10])
June 4, 2012
114

Figure 47: 5 × 5 Laplacian of a Gaussian mask (LoG) (from [Gon02, Chap. 10])
June 4, 2012
115

Figure 48: Example of Sobel operator, Laplacian and Laplacian smoothed with a Gaussian (LoG)
ah02])
Figure 49: Example of some natural textures (from [J¨

June 4, 2012 116


Figure 50: (upper left) Image of a periodic texture, (upper right) its power spectrum S(u, v), (mid-
dle left) S(r), and (middle right) S(θ). (lower left) A different periodic texture and (lower right)
corresponding S(θ) (from [Gon02, Chap. 11, Fig. 11.24])
June 4, 2012 117
June 4, 2012
118

Figure 51: Example of optical flow computation: (a) first and (b) second image of sequence, (c) estimated flow field, (d) detail of (c), (f) color coded
flow field, and (f) color code map for vector representation (From: http://www.cs.ucf.edu/˜jxiao/opticalflow.htm)
June 4, 2012 YC

YR YL

XC
f PL f
PR

XR XL

ZR ZL

P (xC , yC , zC )

ZC
119

Figure 52: Simplified stereo setup (after [Nie90])


June 4, 2012

Figure 53: Quad-tree representation of image segmentation (after [Nie90, p. 108], cf. also [Gon02, p. 616])
120
June 4, 2012

Figure 54: Split-and-Merge Algorithm: Simple sample image in numeric and grey-level representation
121
June 4, 2012

Figure 55: Split-and-Merge Algorithm: Initial segmentation of sample image obtained by selecting 3rd level of quad-tree representation
122
June 4, 2012

Figure 56: Split-and-Merge Algorithm: Result of possible merge operations on the sample image
123
June 4, 2012

Figure 57: Split-and-Merge Algorithm: Result of possible splitting operations on the sample image
124
June 4, 2012

Figure 58: Split-and-Merge Algorithm: Result of merging operations outside the quad-tree structure for the sample image
125
June 4, 2012

Figure 59: Split-and-Merge Algorithm: Final segmentation for result the sample image
126
June 4, 2012

Figure 60: ) Example of results obtained with the Canny Edge Detector
127
June 4, 2012
128

Figure 61: Illustration of the Hough transform (from [Gon02, Chap. 10, Fig. 10.20])
June 4, 2012
129

Figure 62: Example of the Hough transform on sample infrared image (from [Gon02, Chap. 10, Fig. 10.21])
June 4, 2012
130

Figure 63: Scale space representation scheme for images: Gaussian smoothed image at different scales and organized into octaves (left) and
Difference of Gaussian representation (right) (from [Low04])
June 4, 2012

Figure 64: Determining extrema in DoG image representations (from [Low04])


131
June 4, 2012

Figure 65: Example of keypoint detection on a natural image: (a) original image (b) initial 832 keypoint locations at maxima and minima in the DoG
representation (from [Low04])
132
June 4, 2012

Figure 66: Example of keypoint detection on a natural image: (a) original image (b) initial 832 keypoint locations at maxima and minima in the DoG
representation (from [Low04])
133
Figure 67: Correspondences between matched keypoints within two images of a well known building
taken from different view points

June 4, 2012 134


June 4, 2012

Figure 68: Example of sample set of face images (from http://www.pages.drexel.edu/s̃is26/Eigenface%20Tutorial.htm)


135
June 4, 2012
136

Figure 69: Example of Eigenfaces obtained (from http://whitechapel.media.mit.edu/vismod/demos/facerec/index.html)


June 4, 2012

Figure 70: Example of face/non-face


classification using image reconstruction via Eigenfaces (from
http://www.cs.princeton.edu/ cdecoro/eigenfaces/)
137
Identity

June 4, 2012
Input image
Γ

Match? Projected image


Φ Φ̂

ui ui

Mean image
Ψ

Distance measure

Φ − Φ̂
138

Figure 71: Processing steps necessary for implementation of the Eigenface approach (after [Tur91])

You might also like