Statistical Digital Signal Processing and Modeling
Statistical Digital Signal Processing and Modeling
SIGNAL PROCESSING
AND MODELING
MONSON H. HAYES
Georgia Institute of Technology
It is also dedicated to Michael, Kimberly, and Nicole, for the joy that
they have brought into my life, and to my parents for all of their love and
support of me in all of my endeavors.
CONTENTS
Preface xi
1 INTRODUCTION 1
2 BACKGROUND 7
2.1 Introduction 7
2.2 Discrete-Time Signal Processing 7
2.2.1 Discrete-Time Signals 8
2.2.2 Discrete-Time Systems 9
2.2.3 Time-Domain Descriptions of LSI Filters 12
2.2.4 The Discrete-Time Fourier Transform 12
2.2.5 The z-Transform 14
2.2.6 Special Classes of Filters 16
2.2.7 Filter Flowgraphs 18
2.2.8 The DFT and FFT 18
2.3 Linear Algebra 20
2.3.1 Vectors 21
2.3.2 Linear Independence, Vector Spaces, and Basis Vectors 24
2.3.3 Matrices 25
2.3.4 Matrix Inverse 27
2.3.5 The Determinant and the Trace 29
2.3.6 Linear Equations 30
2.3.7 Special Matrix Forms 35
2.3.8 Quadratic and Hermitian Forms 39
2.3.9 Eigenvalues and Eigenv~ctors 40
2.3.10 Optimization Theory 48
2.4 Summary 52
2.5 Problems 52
Index 599
PREfACE
This book is the culmination of a project that began as a set of notes for a graduate
level course that is offered at Georgia Tech. In writing this book, there have been many
challenges. One of these was the selection of an appropriate title for the book. Although the
title that was selected is Statistical Signal Processing and Modeling, anyone of a number
of other titles could equally well have been chosen. For example, if the title of a book is
to capture its central theme, then the title perhaps could have been Least Squares Theory
in Signal Processing. If, on the other hand, the title should reflect the role of the book
within the context of a course curriculum, then the title should have been A Second Course
in Discrete-Time Signal Processing. Whatever the title, the goal of this book remains the
same: to provide a comprehensive treatment of signal processing algorithms for modeling
discrete-time signals, designing optimum digital filters, estimating the power spectrum of
a random process, and designing and implementing adaptive filters.
In looking through the Table of Contents, the reader may wonder what the reasons were
in choosing the collection of topics in this book. There are two. The first is that each topic
that has been selected is not only important, in its own right, but is also important in a
wide variety of applications such as speech and audio signal processing, image processing,
array processing, and digital communications. The second is that, as the reader will soon
discover, there is a remarkable relationship that exists between these topics that tie together
a number of seemingly unrelated problems and applications. For example, in Chapter 4 we
consider the problem of modeling a signal as the unit sample response of an all-pole filter.
Then, in Chapter 7, we find that all-pole signal modeling is equivalent to the problem of
designing an optimum (Wiener) filter for linear prediction. Since both problems require
finding the solution to a set of Toeplitz linear equations, the Levinson recursion that is
derived in Chapter 5 may be used to solve both problems, and the properties that are shown
to apply to one problem may be applied to the other. Later, in Chapter 8, we find that
an all-pole model performs a maximum entropy extrapolation of a partial autocorrelation
sequence and leads, therefore, to the maximum entropy method of spectrum estimation.
This book possesses some unique features that set it apart from other treatments of
statistical signal processing and modeling. First, each chapter contains numerous examples
that illustrate the algorithms and techniques presented in the text. These examples play
an important role in the learning process. However, of equal or greater importance is the
working of new problems by the student. Therefore, at the end of each chapter, the reader
will find numerous problems that range in difficulty from relatively simple exercises to more
involved problems. Since many of the problems introduce extensions, generalizations, or
applications of the material in each chapter, the reader is encouraged to read through the
problems that are not worked. In addition to working problems, another important step
in the learning process is to experiment with signal processing algorithms on a computer
using either real or synthetic data. Therefore, throughout the book the reader will find
MATLAB programs that have been written for most of the algorithms and techniques that
are presented in the book and, at the end of each chapter, the reader will find a variety of
computer exercises that use these programs. In these exercises, the student will study the
performance of signal processing algorithms, look at ways to generalize the algorithms or
make them more efficient, and write new programs.
xii PREFACE
Another feature that is somewhat unique to this book concerns the treatment of complex
signals. Since a choice had to be made about how to deal with complex signals, the easy thing
to have done would have been to consider only real-valued signals, leaving the generalization
to complex signals to the exercises. Another possibility would have been to derive all
results assuming that signals are real-valued, and to state how the results would change for
complex-valued signals. Instead, however, the approach that was taken was to assume, from
the beginning, that all signals are complex-valued. This not only saves time and space, and
avoids having to jump back and forth between real-valued and complex-valued signals, but
it also allows the reader or instructor to easily treat real-valued signals as a special case.
This book consists of nine chapters and an appendix and is suitable for a one-quarter or
one-semester course at the Senior or Graduate level. It is intended to be used in a course that
follows an introductory course in discrete-time signal processing. Although the prerequisites
for this book are modest, each is important. First, it is assumed that the reader is familiar
with the fundamentals of discrete-time signal processing including difference equations,
discrete convolution and linear filtering, the discrete-time Fourier transform, and the z-
transform. This material is reviewed in Chapter 2 and is covered in most standard textbooks
on discrete-time signal processing [3,6]. The second prerequisite is a familiarity with linear
algebra. Although an extensive background in linear algebra is not required, it will be
necessary for the reader to be able to perform a variety of matrix operations such as matrix
multiplication, finding the inverse of a matrix, and evaluating its determinant. It will also
be necessary for the reader to be able to find the solution to a set of linear equations and to
be familiar with eigenvalues and eigenvectors. This is standard material that may be found
in anyone of a number of excellent textbooks on linear algebra [2,5], and is reviewed in
Chapter 2. Also in Chapter 2 is a section of particular importance that may not typically
be in a student's background. This is the material covered in Section 2.3.10, which is
concerned with optimization theory. The specific problem of interest is the minimization
of a quadratic function of one or more complex variables. Although the minimization of a
quadratic function of one or more real variables is fairly straightforward and only requires
setting the derivative of the function with respect to each variable equal to zero, there are
some subtleties that arise when the variables are complex. For example, although we know
that the quadratic function fez) = Izl 2 has a unique minimum that occurs at z = 0, it is
not clear how to formally demonstrate this since this function is not differentiable. The last
prerequisite is a course in basic probability theory. Specifically, it is necessary for the reader
to be able to compute ensemble averages such as the mean and variance, to be familiar with
jointly distributed random variables, and to know the meaning of terms such as statistical
independence, orthogonality, and correlation. This material is reviewed in Chapter 3 and
may be found in many textbooks [1,4].
This book is structured to allow a fair amount of flexibility in the order in which the
topics may be taught. The first three chapters stand by themselves and, depending upon
the background of the student, may either be treated as optional reading, reviewed quickly,
or used as a reference. Chapter 2, for example, reviews the basic principles of discrete-
time signal processing and introduces the principles and techniques from linear algebra
that will be used in the book. Chapter 3, on the other hand, reviews basic principles of
probability theory and provides an introduction to discrete-time random processes. Since it
is not assumed that the reader has had a course in random processes, this chapter develops
all of the necessary tools and techniques that are necessary for our treatment of stochastic
signal modeling, optimum linear filters, spectrum estimation, and adaptive filtering.
PREFACE xiii
In Chapter 4, we begin our treatment of statistical signal processing and modeling with
the development of a number of techniques for modeling a signal as the output of a linear
shift-invariant filter. Most of this chapter is concerned with models for deterministic signals,
which include the methods of Pade, Prony, and Shanks, along with the autocorrelation and
covariance methods. Due to its similarity with the problem of signal modeling, we also look
at the design of a least-squares inverse filter. Finally, in Section 4.7, we look at models for
discrete-time random processes, and briefly explore the important application of spectrum
estimation. Depending on the interest of the reader or the structure of a course, this section
may be postponed or omitted without any loss in continuity.
The initial motivation for the material in Chapter 5 is to derive efficient algorithms for
the solution to a set of Toeplitz linear equations. However, this chapter accomplishes much
more than this. Beginning with the derivation of a special case of the Levinson recursion
known as the Levinson-Durbin recursion, this chapter then proceeds to establish a number of
very important properties and results that may be derived from this recursion. These include
the introduction of the lattice filter structure, the proof of the stability of the all-pole model
that is formed using the autocorrelation method, the derivation of the Schur-Cohn stability
test for digital filters, a discussion of the autocorrelation extension problem, the Cholesky
decomposition of a Toeplitz matrix, and a procedure for recursively computing the inverse
of a Toeplitz matrix that may be used to derive the general Levinson recursion and establish
an interesting relationship that exists between the spectrum estimates produced using the
minimum variance method and the maximum entropy method. The chapter concludes with
the derivation of the Levinson and the split Levinson recursions.
The focus of Chapter 6 is on lattice filters and on how they may be used for signal
modeling. The chapter begins with the derivation of the FIR lattice filter, and then proceeds
to develop other lattice filter structures, which include the all-pole and allpass lattice filters,
lattice filters that have both poles and zeros, and the split lattice filter. Then, we look at
lattice methods for all-pole signal modeling. These methods include the forward covariance
method, the backward covariance method, and the Burg algorithm. Finally, the chapter
concludes by looking at how lattice filters may be used in modeling discrete-time random
processes.
In Chapter 7 we tum our attention to the design of optimum linear filters for estimating
one process from another. The chapter begins with the design of an FIR Wiener filter and
explores how this filter may be used in such problems as smoothing, linear prediction, and
noise cancellation. We then look at the problem of designing IIR Wiener filters. Although a
noncausal Wiener filter is easy to design, when a causality constraint in imposed on the filter
structure, the design requires a spectral factorization. One of the limitations of the Wiener
filter, however, is the underlying assumption that the processes that are being filtered are
wide-sense stationary. As a result, the Wiener filter is linear and shift-invariant. The chapter
concludes with an introduction to the discrete Kalman filter which, unlike the Wiener filter,
may be used for stationary as well as nonstationary processes.
Chapter 8 considers the problem of estimating the power spectrum of a discrete-time
random process. Beginning with the classical approaches to spectrum estimation, which in-
volve taking the discrete-time Fourier transform of an estimated autocorrelation sequence,
we will examine the performance of these methods and will find that they are limited
in resolution when the data records are short. Therefore, we then look at some modem
approaches to spectrum estimation, which include the minimum variance method, the max-
imum entropy method, and parametric methods that are based on developing a model for a
xiv PREFACE
random process and using this model to estimate the power spectrum. Finally, we look at
eigenvector methods for estimating the frequencies of a harmonic process. These methods
include the Pisarenko harmonic decomposition, MUSIC, the eigenvector method, the min-
imum norm algorithm, and methods that are based on a principle components analysis of
the autocorrelation matrix.
Finally, Chapter 9 provides an introduction to the design, implementation, and analysis
of adaptive filters. The focus of this chapter is on the LMS algorithm and its variations,
and the recursive least squares algorithm. There are many applications in which adaptive
filters have played an important role such as linear prediction, echo cancellation, channel
equalization, interference cancellation, adaptive notch filtering, adaptive control, system
identification, and array processing. Therefore, included in this chapter are examples of
some of these applications.
Following Chapter 9, the reader will find an Appendix that contains some documentation
on how to use the MATLAB m-files that are found in the book. As noted in the appendix,
these m-files are available by anonymous ftp from
ftp.eedsp.gatech.edu/pub/users/mhayes/stat_dsp
and may be accessed from the Web server for this book at
http://www.ece.gatech.edu/users/mhayes/stat_dsp
The reader may also wish to browse this Web site for additional information such as new
problems and m-files, notes and errata, and reader comments and feedback.
A typical one quarter course for students who have not been exposed to discrete-time
random processes might cover, in depth, Sections 3.3-3.7 of Chapter 3; Chapter 4; Sec-
tions 5.1-5.3 of Chapter 5; Sections 6.1, 6.2, and 6.4-6.7 of Chapter 6; Chapter 7; and
Chapter 8. For students that have had a formal course in random processes, Chapter 3 may
be quickly reviewed or omitted and replaced with Sections 9.1 and 9.2 of Chapter 9. Al-
ternatively, the instructor may wish to introduce adaptive approaches to signal modeling
at the end of Chapter 4, introduce the adaptive Wiener filter after Chapter 7, and discuss
techniques for adaptive spectrum estimation in Chapter 8. For a semester course, on the
other hand, Chapter 9 could be covered in its entirety. Starred (*) sections contain material
that is a bit more advanced, and these sections may be easily omitted without any loss of
continuity.
In teaching from this book, I have found it best to review linear algebra only as it is
needed, and to begin the course by spending a couple of lectures on some of the more
advanced material from Sections 3.3-3.7 of Chapter 3, such as ergodicity and the mean
ergodic theorems, the relationship between the power spectrum and the maximum and
minimum eigenvalues of an autocorrelation matrix, spectral factorization, and the Yule-
Walker equations. I have also found it best to restrict attention to the case of real-valued
signals, since treating the complex case requires concepts that may be less familiar to the
student. This typically only requires replacing derivatives that are taken with respect to z*
with derivatives that are taken with respect to z.
Few textbooks are ever written in isolation, and this book is no exception. It has been
shaped by many discussions and interactions over the years with many people. I have been
fortunate in being able to benefit from the collective experience of my colleagues at Georgia
Tech: Tom Barnwell, Mark Clements, Jim McClellan, Vijay Madisetti, Francois Malassenet,
Petros Maragos, Russ Mersereau, Ron Schafer, Mark Smith, and Doug Williams. Among
REFERENCES xv
these people, I would like to give special thanks to Russ Mersereau, my friend and colleague,
who has followed the development of this book from its beginning, and who will probably
never believe that this project is finally done. Many of the ways in which the topics in this
book have been developed and presented are attributed to Jim McClellan, who first exposed
me to this material during my graduate work at M.LT. His way of looking at the problem of
signal modeling has had a significant influence in much of the material in this book. I would
also like to acknowledge Jae Lim and Alan Oppenheim, whose leadership and guidance
played a significant and important role in the development of my professional career. Other
people I have been fortunate to be able to interact with over the years include my doctoral
students Mitch Wilkes, Erlandur Karlsson, Wooshik Kim, David Mazel, Greg Vines, Ayhan
Sakarya, Armin Kittel, Sam Liu, Baldine-Brunel Paul, HaWk Aydinoglu, Antai Peng, and
Qin Jiang. In some way, each of these people has influenced the final form of this book.
I would also like to acknowledge all of the feedback and comments given to me by all
of the students who have studied this material at Georgia Tech using early drafts of this
book. Special thanks goes to Ali Adibi, Abdelnaser Adas, David Anderson, Mohamed-Slim
Alouini, Osama AI-Sheikh, Rahmi Hezar, Steven Kogan, and Jeff Schodorf for the extra
effort that they took in proof-reading and suggesting ways to improve the presentation of
the material.
I would like to express thanks and appreciation to my dear friends Dave and Annie,
Pete and Linda, and Olivier and Isabelle for all of the wonderful times that we have shared
together while taking a break from our work to relax and enjoy each other's company.
Finally, we thank the following reviewers for their suggestions and encouragement
throughout the development of this text: Tom Alexander, North Carolina State University;
Jan P. Allenbach, Purdue University; Andreas Antoniou, University of Victoria; Ernest G.
Baxa, Clemson University; Takis Kasparis, University of Central Florida; JoAnn B. Koskol,
Widener University; and Charles W. Therrien, Naval Postgraduate School.
References _
Our ability to communicate is key to our society. Communication involves the exchange of
information, and this exchange may be over short distances, as when two people engage in
a face-to-face conversation, or it may occur over large distances through a telephone line or
satellite link. The entity that carries this information from one point to another is a signal.
A signal may assume a variety of different forms and may carry many different types of
information. For example, an acoustic wave generated by the vocal tract of a human carries
speech, whereas electromagnetic waves carry audio and video information from a radio
or television transmitter to a receiver. A signal may be a function of a single continuous
variable, such as time, it may be a function of two or more continuous variables, such as
(x, y, t) where x and yare spatial coordinates, or it may be a function of one or more
discrete variables.
Signal processing is concerned with the representation, manipulation, and transforma-
tion of signals and the information that they carry. For example, we may wish to enhance a
signal by reducing the noise or some other interference, or we may wish to process the signal
to extract some information such as the words contained in a speech signal, the identity of
a person in a photograph, or the classification of a target in a radar signal.
Digital signal processing (DSP) is concerned with the processing of information that
is represented in digital form. Although DSP, as we know it today, began to bloom in the
1960s, some of the important and powerful processing techniques that are in use today may
be traced back to numerical algorithms that were proposed and studied centuries ago. In
fact, one of the key figures whose work plays an important role in laying the groundwork
for much of the material in this book, and whose name is attached to a number of terms
and techniques, is the mathematician Karl Friedrick Gauss. Although Gauss is usually
associated with terms such as the Gaussian density function, Gaussian elimination, and the
Gauss-Seidel method, he is perhaps best known for his work in least squares estimation. I
It should be interesting for the reader to keep a note of some of the other personalities
appearing in this book and the dates in which they made their contributions. We will find,
for example, that Prony's work on modeling a signal as a sum of exponentials was published
in 1795, the work of Pade on signal matching was published in 1892, and Schuster's work
I Gauss is less well known for his work on fast algorithms for computing the Fourier series coefficients of a
sampled signal. Specifically, he has been attributed recently with the derivation of the radix-2 decimation-in-time
Fast Fourier Transform (FFT) algorithm [6].
,
2 INTRODUCTION
on the periodogram appeared in 1898. Two of the more recent pioneers whose names we
will encounter and that have become well known in the engineering community are those
of N. Wiener (1949) and R. E. Kalman (1960).
Since the early 1970s when the first DSP chips were introduced, the field of Digital
Signal Processing (DSP) has evolved dramatically. With a tremendously rapid increase in
the speed of DSP processors along with a corresponding increase in their sophistication and
computational power, digital signal processing has become an integral part of many products
and applications. Coupled with the development of efficient algorithms for performing
complex signal processing tasks, digital signal processing is radically changing the way
that signal processing is done and is becoming a commonplace term.
The purpose of this book is to introduce and explore the relationships between four very
important signal processing problems: signal modeling, optimum filtering, spectrum esti-
mation, and adaptive filtering. Although these problems may initially seem to be unrelated,
as we progress through this book we will find a number of different problems reappearing
in different forms and we will often be using solutions to previous problems to solve new
ones. For example, we will find that one way to derive a high-resolution spectrum estimate
is to use the tools and techniques derived for modeling a signal and then use the model to
estimate the spectrum.
The prerequisites necessary for this book are modest, consisting of an introduction to
the basic principles of digital signal processing, an introduction to basic probability theory,
and a general familiarity with linear algebra. For purposes of review as well as to introduce
the notation that will be used throughout this book, Chapter 2 provides an overview of the
fundamentals of DSP and discusses the concepts from linear algebra that will be useful
in our representation and manipulation of signals. The following paragraphs describe, in
general terms, the topics that are covered in this book and overview the importance of these
topics in applications.
An introductory course in digital signal processing is concerned with the analysis and design
of systems for processing deterministic discrete-time signals. A deterministic signal may be
defined as one that can be described by a mathematical expression or that can be reproduced
repeatedly. Simple deterministic signals include the unit sample, a complex exponential,
and the response of a digital filter to a given input. In almost any application, however, it
becomes necessary to consider a more general "type of signal known as a random process.
Unlike a deterministic signal, a random process is an ensemble or collection of signals
that is defined in terms of the statistical properties that the ensemble possesses. Although a
sinusoid is a deterministic signal, it may also be used as the basis for the random process
consisting of the ensemble of all possible sinusoids having a given amplitude and frequency.
The randomness or uncertainty in this process is contained in the phase of the sinusoid. A
more common random process that we will be frequently concerned with is noise. Noise is
pervasive and occurs in many different places and in many forms. For example, noise may
be quantization errors that occur in an AID converter, it may be round-off noise injected
into the output of a fixed point digital filter, or it may be an unwanted disturbance such
as engine noise in the cockpit of an airplane or the random disturbances picked up by a
sonar array on the bottom of the ocean floor. As the name implies, a random process is
SIGNAL MODELING 3
SIGNAL MODELING
The efficient representation of signals is at the heart of many signal processing problems
and applications. For example, with the explosion of information that is transmitted and
stored in digital form, there is an increasing need to compress these signals so that they
may be more efficiently transmitted or stored. Although this book does not address the
problem of signal compression directly, it is concerned with the representation of discrete-
time signals and with ways of processing these signals to enhance them or to extract some
information from them. We will find that once it is possible to accurately model a signal, it
then becomes possible to perform important signal processing tasks such as extrapolation
and interpolation, and we will be able to use the model to classify signals or to extract
certain features or characteristics from them.
One approach that may be used to compress or code a discrete-time signal is to find
a model that is able to provide an accurate representation for the signal. For example, we
may consider modeling a signal as a sum of ~inusoids. The model or code would consist of
the amplitudes, frequencies, and phases of the sinusoidal components. There is, of course,
a plethora of different models that may be used to represent discrete-time signals, ranging
from simple harmonic decompositions to fractal representations [2]. The approach that is
used depends upon a number of different factors including the type of signal that is to
be compressed and the level of fidelity that is required in the decoded or uncompressed
signal. In this book, our focus will be on how we may most accurately model a signal as
the unit sample response of a linear shift-invariant filter. What we will discover is that there
are many different ways in which to formulate such a signal modeling problem, and that
each formulation leads to a solution having different properties. We will also find that the
techniques used to solve the signal modeling problem, as well as the solutions themselves,
will be useful in our finding solutions to other problems. For example, the FIR Wiener
filtering problem may be solved almost by inspection from the solutions that we derive for
4 INTRODUCTION
all-pole signal modeling, and many of the approaches to the problem of spectrum estimation
are based on signal modeling techniques.
In spite of the ever-increasing power and speed of digital signal processors, there will
always be a need to develop fast algorithms for performing a specific task or executing a
particular algorithm. Many of the problems that we will be solving in this book will require
that we find the solution to a set of linear equations. Many different approaches have been
developed for solving these equations, such as Gaussian elimination and, in some cases,
efficient algorithms exist for finding the solution. As we will see in our development of
different approaches to signal modeling and as we will see in our discussions of Wiener
filtering, linear equations having a Toeplitz form will appear often. Due to the tremendous
amount of structure in these Toeplitz linear equations, we will find that the number of
computations required to solve these equations may be reduced from order n 3 required
for Gaussian elimation to order n 2 . The key in this reduction is the Levinson and the more
specialized Levinson-Durbin recursions. Interesting in their own right, what is perhaps even
more important are the properties and relationships that are hidden within these recursions.
We will see, for example, that embedded in the Levinson-Durbin recursion is a filter structure
known as the lattice filter that has many important properties that make them attractive in
signal processing applications. We will also find that the Levinson-Durbin recursion forms
the basis for a remarkably simple stability test for digital filters.
LATTICE FILTERS
As mentioned in the previous paragraph, the lattice filter is a structure that emerges from
the Levinson-Durbin recursion. Although most students who have taken a first course in
digital signal processing are introduced to a number of different filter structures such as
the direct form, cascade, and parallel structures, the lattice filter is typically not introduced.
Nevertheless, there are a number of advantages that a lattice filter enjoys over these other
filter structures that often make it a popular structure to use in signal processing applications.
The first is the modularity of the filter. It is this modularity and the stage-by-stage optimality
of the lattice filter for linear prediction and all-pole signal modeling that allows for the order
of the lattice filter to be easily increased or decreased. Another advantage of these filters
is that it is trivial to determine whether or not the filter is minimum phase (all of the
roots inside the unit circle). Specifically, all that is required is to check that all of the filter
coefficients (reflection coefficients) are less than 1 in magnitude. Finally, compared to other
filter structures, the lattice filter tends to be less sensitive to parameter quantization effects.
There is always a desire or the need to design the optimum filter-the one that will perform
a given task or function better than any other filter. For example, in a first course in DSP
one learns how to design a linear phase FIR filter that is optimum in the Chebyshev sense of
ADAPTIVE FILTERS 5
minimizing the maximum error between the frequency response ofthe filter and the response
of the ideal filter [7]. The Wiener and Kalman filters are also optimum filters. Unlike a typical
linear shift-invariant filter, a Wiener filter is designed to process a given signal x(n), the
input, and form the best estimate in the mean square sense of a related signal den), called
the desired signal. Since x(n) and den) are not known in advance and may be described
only in a statistical sense, it is not possible to simply use H (e jW ) = D(e jW ) / X (e jW ) as
the frequency response of the filter. The Kalman filter may be used similarly to recursively
find the best estimate. As we will see, these filters are very general and may be used to
solve a variety of different problems such as prediction, interpolation, deconvolution, and
smoothing.
SPECTRUM ESTIMATION
As mentioned earlier in this chapter, the frequency domain provides a different window
through which one may view a discrete-time signal or random process. The power spectrum
is the Fourier transform of the autocorrelation sequence of a stationary process. In a number
of applications it is necessary that the power spectrum of a process be known. For example,
the IIR Wiener filter is defined in terms of the power spectral densities of two processes, the
input to the filter and the desired output. Without prior knowledge of these power spectral
densities it becomes necessary to estimate them from observations of the processes. The
power spectrum also plays an important role in the detection and classification of periodic
or narrowband processes buried in noise.
In developing techniques for estimating the power spectrum of a random process, we will
find that the simple approach of Fourier transforming the data from a sample realization does
not provide a statistically reliable or high-resolution estimate of the underlying spectrum.
However, if we are able to find a model for the process, then this model may be used to
estimate the spectrum. Thus, we will find that many of the techniques and results developed
for signal modeling will prove useful in understanding and solving the spectrum estimation
problem.
ADAPTIVE FILTERS
The final topic considered in this book is adaptive filtering. Throughout most of the discus-
sions of signal modeling, Wiener filtering, and spectrum estimation, it is assumed that the
signals that are being processed or analyzed are stationary; that is, their statistical properties
are not varying in time. In the real world, however, this will never be the case. Therefore,
these problems are reconsidered within the context of nonstationary processes. Beginning
with a general FIR Wiener filter, it is shown how a gradient descent algorithm may be used
to solve the Wiener-Hopf equations and design the Wiener filter. Although this algorithm
is well behaved in terms of its convergence properties, it is not generally used in practice
since it requires knowledge of the process statistics, which are generally unknown. Another
approach, which has been used successfully in many applications, is the stochastic gradient
algorithm known as LMS. Using a simple gradient estimate in place of the true gradient
in a gradient descent algorithm, LMS is efficient and well understood in terms of its con-
vergence properties. A variation of the LMS algorithm is the perceptron algorithm used
6 INTRODualON
in pattern recognition and is the starting point for the design of a neural network. Finally,
while the LMS algorithm is designed to solve the Wiener filtering problem by minimizing
a mean square error, a deterministic least squares approach leads to the development of the
RLS algorithm. Although computationally much more involved than the stochastic gradient
algorithms, RLS enjoys a significant performance advantage.
There are many excellent textbooks that deal in much more detail and rigor with the
subject of adaptive filtering and adaptive signal processing [1, 3,4, 5, 8]. Here, our goal
is to simply provide the bridge to that literature, illustrating how the problems that we
have solved in earlier chapters may be adapted to work in nonstationary environments. It is
here, too, that we are able to look at some important applications such as linear prediction,
channel equalization, interference cancelation, and system identification.
CLOSING
The problems that are considered in the following chapters are fundamental and important.
The similarities and relationships between these problems are striking and remarkable.
Many of the problem solving techniques presented in this text are powerful and general
and may be successfully applied to a variety of other problems in digital signal processing
as well as in other disciplines. It is hoped that the reader will share in the excitement of
unfolding these relationships and embarking on a journey into the world of statistical signal
processing.
References _
2. J INTRODUCTION
There are two goals of this chapter. The first is to provide the reader with a summary as
well as a reference for the fundamentals of discrete-time signal processing and the basic
vector and matrix operations of linear algebra. The second is to introduce the notation and
terminology that will be used throughout the book. Although it would be possible to move
directly to Chapter 3 and only refer to this chapter as a reference as needed, the reader should,
at a minimum, peruse the topics presented in this chapter in order to become familiar with
what is expected in terms of background and to become acquainted with the terminology
and notational conventions that will be used.
The organization of this chapter is as follows. The first part contains a review of the
fundamentals of discrete-time signal processing. This review includes a summary of impor-
tant topics in discrete-time signal manipulation and representation such as linear filtering,
difference equations, digital networks, discrete-time Fourier transforms, z-transforms, and
the DFT. Since this review is not intended to be a comprehensive treatment of discrete-
time signal processing, the reader wishing a more in-depth study may consult anyone of a
number of excellent texts including [7, 8, 11]. The second part of this chapter summarizes
the basic ideas and techniques from linear algebra that will be used in later chapters. The
topics that are covered include vector and matrix manipulations, the solution of linear equa-
tions, eigenvalues and eigenvectors, and the' minimization of quadratic forms. For a more
detailed treatment of linear algebra, the reader may consult anyone of a number of standard
textbooks such as [6, 10].
7
8 BACKGROUND
8(n) ={ ~ n=O
otherwise
and plays the same role in discrete-time signal processing that the unit impulse plays in
continuous-time signal processing. The unit sample may be used to decompose an arbitrary
signal x (n) into a sum of weighted (scaled) and shifted unit samples as follows
00
x(n) = L x(k)8(n - k)
k=-oo
This decomposition is the discrete version ofthe sifting property for continuous-time signals.
The unit step, denoted by u(n), is defined by
I' n::::O
u(n) ={ 0 otherwise
and is related to the unit sample by
n
u(n) = L 8(k)
k=-oo
! Discrete-time signals may be either deterministic or random (stochastic). Here it is assumed that the signals are
deterministic. In Chapter 3 we will consider the characterization of discrete-time random processes.
DISCRETE-TIME SIGNAL PROCESSING 9
x(n)
•
n
-2 8 9
where Wo is some real constant. Complex exponentials are useful in the Fourier decompo-
sition of signals as described in Section 2.2.4.
Discrete-time signals may be conveniently classified in terms of their duration or extent.
For example, a discrete-time sequence is said to be afinite length sequence if it is equal to
zero for all values of n outside a finite interval [N], N2]. Signals that are not finite in length,
such as the unit step and the complex exponential, are said to be infinite length sequences.
Since the unit step is equal to zero for n < 0, it is said to be a right-sided sequence. In
general, a right-sided sequence is any infinite-length sequence that is equal to zero for all
values of n < no for some integer no. Similarly, an infinite-length sequence x(n) is said to
be left sided if, for some integer no, x(n) = 0 for all n > no. An example of a left-sided
sequence is
n :s no
x(n)=u(no-n)={ ~ n > no
which is a time-reversed and delayed unit step. An infinite-length signal that is neither
right sided nor left sided, such as the complex exponential, is referred to as a two-sided
sequence.
or
yen) = 0.5y(n - 1) + x(n)
It is also possible, however, to describe a system in terms of an algorithm that provides a
sequence of instructions or operations that is to be applied to the input signal values. For
10 BACKGROUND
__x_(n_)_...I T_[__] ~
y_(n_)_=. .T[x(n)]
example,
Yl (n) = O.5Yl (n - 1) + O.25x(n)
Y2(n) = O.25Y2(n - 1) + O.5x(n)
Y3(n) = 0.4Y3(n - 1) + O.5x(n)
yen) = Yl(n) + Y2(n) + Y3(n)
is a sequence of instructions that defines a third-order recursive digital filter in parallel form.
In some cases, a system may conveniently be specified in terms of a table that defines the
set of all possible input/output signal pairs of interest. In the Texas Instruments Speak and
Spell™, for example, pushing a specific button (the input signal) results in the synthesis of
a given spoken letter or word (the system output).
Discrete-time systems are generally classified in terms ofthe properties that they possess.
The most common properties include linearity, shift-invariance, causality, stability, and
invertibility, which are described in the following sections.
Linearity and Shift-Invariance. The two system properties that are of the greatest
importance for simplifying the analysis and design of discrete-time systems are linearity
and shift-invariance. A system T[ -] is said to be linear if, for any two inputs Xl (n) and
x2(n) and for any two (complex-valued) constants a and b,
T[axl (n) + bX2(n)] = aT[xl (n)] + bT[x2(n)]
In other words, the response of a linear system to a sum of two signals is the sum of the
two responses, and scaling the input by a constant results in the output being scaled by the
same constant. The importance of this property is evidenced in the observation that if the
input is decomposed into a superposition of weighted and shifted unit samples,
00
x(n) = L x(k)8(n - k)
k=-oo
where hk(n) = T[8(n -k)] is the response of the system to the delayed unit sample 8(n -k).
Equation (2.1) is referred to as the superposition sum and it illustrates that a linear system
is completely specified once the signals h k (n) are known.
The second important property is shift-invariance. 2 A system is said to be shift-invariant
if a shift in the input by no results in a shift in the output by no. Thus, if y(n) is the response
2Some authors use the term time-invariance instead of shift-invariance. However, since the independent variable,
n, may represent something other than time, such as distance, we prefer to use the term shift-invariance.
DISCRETE-TIME SIGNAL PROCESSING 11
of a shift-invariant system to an input x(n), then for any shift in the input, x(n - no), the
response of the system will be yen - no). In effect, shift-invariance means that the properties
of the system do not change with time.
A system that is both linear and shift-invariant is called a linear shift-invariant (LSI)
system. For a shift-invariant system, if hen) = T[8(n)] is the response to a unit sample 8(n),
then the response to 8(n - k) is hen - k). Therefore, for an LSI system the superposition
sum given in Eq. (2.1) becomes
00
which is the convolution sum. In order to simplify notation, the convolution sum is often
expressed as
Stability. In many applications, it is important for a system to have a response, yen), that
is bounded in amplitude whenever the input is bounded. A system with this property is
said to be stable in the Bounded-Input Bounded-Output (BIBO) sense. More specifically, a
system is BIBO stable if, for any bounded input, Ix (n) I :::: A < 00, the output is bounded,
ly(n)1 :::: B < 00. In the case of a linear shift-invariant system, stability is guaranteed
whenever the unit sample response is absolutely summable
00
For example, an LSI system with hen) = anu(n) is stable whenever lal < 1.
In this difference equation, p and q are integers that determine the order of the system
and a (I), ... , a (p) and b(O), ... , b(q) are the filter coefficients that define the system. The
difference equation is often written in the form
q p
yen) =L b(k)x(n - k) - L a(k)y(n - k) (2.5)
k=O k=1
which clearly shows that the output yen) is equal to a linear combination of past output
values, yen - k) for k = 1,2, ... , p, along with past and present input values, x(n - k) for
k = 0, I, ... , q. For the special case of p = 0, the difference equation becomes
q
and the output is simply a weighted sum of the current and past input values. As a result,
the unit sample response is finite in length
q
hen) = Lb(k)8(n - k)
k=O
and the system is referred to as a Finite length Impulse Response (FIR) system. However,
if p 1= 0 then the unit sample response is, in general, infinite in length and the system is
referred to as an Infinite length Impulse Response (UR) system. For example, if
yen) = ay(n - 1) + x(n)
then the unit sample response is hen) = anu(n).
In order for the DTFf of a signal to be defined, however, the sum in Eq. (2.7) must converge.
A sufficient condition for the sum to converge uniformly to a continuous function of w is
that x(n) be absolutely summable
00
Although most signals of interest will have a DTFf, signals such as the unit step and the
complex exponential are not absolutely summable and do not have a DTFT. However, if we
allow the DTFf to contain generalized functions then the DTFT of a complex exponential
is an impulse
Iwl < rr
where uo(w - wo) is used to denote an impulse at frequency w = wo. Similarly, the DTFT
of a unit step is
Iwl < rr
The DTFf possesses some symmetry properties of interest. For example, since e-inw
is periodic in w with a period of 2rr, it follows that X (e iw ) is also periodic with a period of
2rr. In addition, if x(n) is real-valued then X(e iw ) will be conjugate symmetric
X(e iw ) = X*(e- iw )
This DTFf is called the frequency response of the filter, and it defines how a complex
exponential is changed in amplitude and phase by the system. Note that the condition for
the existence of the DTFf given in Eq. (2.8) is the same as the condition for BIBO stability of
an LSI system. Therefore, it follows that the DTFf of h(n) exists for BIEO stable systems.
The DTFf is an invertible transformation in the sense that, given the DTFT X (e iw ) of
a signal x(n), the signal may be recovered using the Inverse DTFT (IDTFf) as follows
(2.10)
14 BACKGROUND
In addition to providing a method for computing x(n) from X(e jW ), the IDTFT may also
be viewed as a decomposition of x(n) into a linear combination of complex exponentials.
There are a number of useful and important properties of the DTFT. Perhaps the most
important of these is the convolution theorem, which states that the DTFT of a convolution
of two signals
y(n) = x(n) * h(n)
is equal to the product of the transforms,
Y(e jW ) = X(ejW)H(e jW )
Another useful property is Parseval's theorem, which states that the sum of the squares of
a signal, x(n), is equal to the integral of the square of its DTFT,
(2.11)
Some additional properties of the DTFT are listed in Table 2.1. Derivations and applications
of these properties may be found in references [7, 8,11].
x(n) X(e JW )
Delay x(n - no) e- Jwno X (e JW )
X(e
jW
) = X(z)lz=eiw = L
n=-oo
x(n)e-
jnw
As with the DTFT, the z-transform is only defined when the sum in Eq. (2.12) converges.
Since the sum generally does not converge for all values of z, associated with each z-
transform is a region of convergence that defines those values of z for which the sum
converges. For a finite length sequence, the sum in Eq. (2.12) contains only a finite number
of terms. Therefore, the z-transform of a finite length sequence is a polynomial in z and the
region of convergence will include all values of z (except possibly z = 0 or z = 00). For
right-sided sequences, on the other hand, the region of convergence is the exterior of a circle,
Izl > R_, and for left-sided sequences it is the interior of a circle, Izl < R+, where R_ and
R+ are positive numbers. For two-sided sequences, the region of convergence is an annulus
Just as for the discrete-time Fourier transform, the z-transform has a number of important
and useful properties, some of which are summarized in Table 2.2. In addition, a symmetry
condition that will be of interest in our discussions of the power spectrum is the following.
If x(n) is a conjugate symmetric sequence,
x(n) = x*(-n)
then it z-transform satisfies the relation
This property follows by combining the conjugation and time-reversal properties listed in
Table 2.2. Finally, given in Table 2.3 are some closed-form expressions for some com-
monly found summations. These are often useful in evaluating the z-transform given in
Eq. (2.12).
A z-transform of special importance in the design and analysis of linear shift-invariant
systems is the system function, which is the z-transform of the unit sample response,
00
H(z) = L h(n)z-n
n=-oo
~
n=O
nan = - - - - - - - = - - - -
(l - a)2
L
nan
= (1 ~a)2 lal < 1
n=O
N-\ N-l
For an FIR filter described by a LCCDE of the form given in Eq. (2.6), the system function
is a polynomial in Z-I,
H(z)
k=O
q
= Lb(k)z-k = b(O) n(1 -
k=l
q
ZkZ- 1 ) (2.14)
and the roots of this polynomial, Zb are called the zeros of the filter. Due to the form of
H (z) for FIR filters, they are often referred to as all-zero filters. For an IIR filter described
by the general difference equation given in Eq. (2.5), the system function is a ratio of two
polynomials in Z-I ,
q q
1
Lb(k)z-k n(1-zk Z - )
H (z) = ----"k_=O--,,- _ b(O) _k;_l _ (2.15)
p
1+ La(k)z-k
k=1
n(1 -
k=l
PkZ- 1
)
The roots of the numerator polynomial, Zb are the zeros of H (z) and the roots of the
denominator polynomial, Pb are called the poles. If the order of the numerator polynomial
is zero, q = 0, then
b(O) b(O)
H(z) = --p--- (2.16)
1 + La(k)z-k
k=l
and H (z) is called an all-pole filter.
If the coefficients a(k) and b(k) in the system function are real-valued (equivalently, if
hen) is real-valued) then
H (z) = H* (z*)
and the poles and zeros of H (z) will occur in conjugate pairs. That is, if H (z) has a pole
(zero) at z = a, then H(z) will have a pole (zero) at z = a*. Some useful z-transform pairs
may be found in Table 2.4.
anu(n)
1 - az- 1
Izi > a
such as speech and image processing and have a frequency response of the form
H(e}W) = A(e}W)e}(fJ-aw) (2.17)
where A (e}W) is a real-valued function of wand a and f3 are constants? In order for a causal
filter to have linear phase and be realizable using afinite-order linear constant coefficient
difference equation, the filter must be FIR [2, 7]. In addition, the linear phase condition
places the constraint that the unit sample response hen) be either conjugate symmetric
(Hermitian),
h*(n) = heN - I - n)
or conjugate antisymmetric (anti-Hermitian)
h*(n) = -heN - I - n)
These constraints, in tum, impose the constraint that the zeros of the system function H (z)
occur in conjugate reciprocal pairs,
H*(z*) = ±ZN-I H(ljz)
In other words, if H(z) has a zero at z = za, then H(z) must also have a zero at z = ljz~.
Another filter having a special form is the allpass filter. Allpass filters are useful in
applications such as phase equalization and have a frequency response with a constant
magnitude
For an allpass filter having a rational system function, H (z) must be of the form
N -I *
H(z) = z-nOAn z - ak
1
k=1 I - akz-
Thus, if H(z) has a zero (pole) at z = ab then H(z) must also have a pole (zero) at
z = ljar
3The term linear phase is often reserved for the special case in which f3 = 0 and A(e)"') is nonnegative. Filters of
the form given in Eq. (2.17) are then said to have generalized linear phase. Here we will not be concerned about
this distinction.
18 BACKGROUND
Finally, a stable and causal filter that has a rational system function with all of its poles
and zeros inside the unit circle is said to be a minimum phase filter. Thus, for a minimum
phase filter, IZk I < I and IPk I < 1 in Eq. (2.15). Note that a minimum phase filter will have
a stable and causal inverse, 1/ H (z), which will also be minimum phase. This property will
be useful in the development of the spectral factorization theorem in Section 3.5 and in the
derivation of the causal Wiener filter in Chapter 7.
x(n) yen)
'r v(n)
(a) An adder.
x(n) a ax(n)
x(n) x(n - 1)
Z-1
-a(l) b(l)
Z-1
-a(2) b(2)
Z-1
-a(3) b(3)
(a) Direct form II filter structure for a 4th order IIR filter.
x(n)
Note that the DFf of x(n) is equal to the discrete-time Fourier transform sampled at N
frequencies that are equally spaced between a and 2:rr,
X(k) = X(ejW)lw=ZJrk/N (2.19)
The Inverse Discrete Fourier Transform (lDFf)
1 N-l
x(n) =- L X(k)ejZrrkn/N (2.20)
N k=O
provides the relationship required to determine x(n) from the DFT coefficients X (k).
Recall that the product of the discrete-time Fourier transforms of two signals corre-
sponds, in the time domain, to the (linear) convolution of the two signals. For the DFf,
however, if H(k) and X(k) are the N-point DFfs of hen) and x(n), respectively, and if
Y(k) = X(k)H(k), then
N
yen) = LX((k»Nh((n - k»N
k=O
which is the N -point circular convolution of x (n) and h (n) where
x((n»N == x(n mod N)
In general, circular convolution of two sequences is not the same as linear convolution.
However, there is a special and important case in which the two are the same. Specifically,
if x(n) is a finite-length sequence oflength N] and h(n) is a finite-length sequence oflength
Nz, then the linear convolution of x (n) and h (n) is of length L = N] + Nz - 1. In this case,
the N -point circular convolution of x (n) and h (n) will be equal to the linear convolution
provided N :::: L.
In using the DFf to perform convolutions or as a method for evaluating samples of
the discrete-time Fourier transform in real-time applications, it is useful to know what the
computational requirements would be to find the DFT. If the DFT is evaluated directly
using Eq. (2.19), then N complex multiplications and N - 1 complex additions would be
required for each value of k. Therefore, an N -point DFf requires a total of N Z complex
multiplications and N 2 - N complex additions. 4 This number may be reduced significantly,
however, by employing anyone of a number of Fast Fourier Transform (FFT) algorithms [7].
For example, if N is a power of 2, N = 21-', a radix-2 decimation-in-time FFf algorithm
requires approximately ~ logz N complex multiplications and N logz N complex additions.
For large N, this represents a significant reduction.
In many of the mathematical developments that will be encountered in later chapters, it will
be convenient to use vector and matrix notation for the representation of signals and the
operations that are performed on them. Such a representation will greatly simplify many of
the mathematical expressions from a notational point of view and allow us to draw upon
4Since we are not taking into account the fact that some of these multiplications are by ±l, the actual number is
a bit smaller than this.
LINEAR ALGEBRA 21
many useful results from linear algebra to simplify or solve these expressions. Although
it will not be necessary to delve too deeply into the theory of linear algebra, it will be
important to become familiar with the basic tools of vector and matrix analysis. Therefore,
in this section we summarize the results from linear algebra that will prove to be useful in
this book.
2.3. J Vectors
A vector is an array of real-valued or complex-valued numbers or functions. Vectors will
be denoted by lowercase bold letters, such as x, a, and v and, in all cases, these vectors
will be assumed to be column vectors. For example,
is a column vector containing N scalars. If the elements of x are real, then x it is said to be
a real vector, whereas if they are complex, then x is said to be a complex vector. A vector
having N elements is referred to as an N -dimensional vector. The transpose of a vector, x T ,
is a row vector
T
x = [Xl, x2,···, xN]
Vectors are useful for representing the values of a discrete-time signal in a concise way. For
example, a finite length sequence X (n) that is equal to zero outside the interval [0, N - 1]
may be represented in vector form as
x(O) ]
x(l)
x= (2.21)
[
x(N - 1)
It is also convenient, in some cases, to consider the set of vectors, x(n), that contain the
signal values x(n), x(n - 1), ... , x(n - N + 1),
x(n)
x(n - 1) ]
x(n) = (2.22)
[
x(n - ~ + 1)
Thus, x(n) is a vector of N elements that is parameterized by the time index n.
Many of the operations that we will be performing on vectors will involve finding the
magnitude of a vector. In order to find the magnitude, however, it is necessary to define a
norm or distance metric. The Euclidean or £2 Norm is one such measure that, for a vector
22 BACKGROUND
x of dimension N, is
(2.23)
IlxllI=Llxil
i=1
In this book, since the £2 norm will be used almost exclusively, the vector norm will be
denoted simply by Ilxll and will be interpreted as the £2 norm unless indicated otherwise.
A vector may be normalized to have unit magnitude by dividing by its norm. For
example, assuming that Ilxll # 0, then
x
v --
x - Ilxll
is a unit norm vector that lies in the same direction as x. For a vector whose elements are
signal values, x(n), the norm squared represents the energy in the signal. For example, the
squared norm of the vector x in Eq. (2.21) is
N-I
Ilxf = L Ix(n)1
2
n=O
In addition to providing a metric for the length of a vector, the norm may also be used to
measure the distance between two vectors
1 2
Laibi
i=l
The inner product defines the geometrical relationship between two vectors. This relation-
ship is given by
where () is the angle between the two vectors. Thus, two nonzero vectors a and b are said
to be orthogonal if their inner product is zero
(a, b) = 0
Two vectors that are orthogonal and have unit norm are said to be orthonormal.
Note that since Icos () I :s 1, the inner product between two vectors is bounded by the
product of their magnitudes
where equality holds if and only if a and bare colinear, i.e., a = ab for some constant a.
Equation (2.25) is known as the Cauchy-Schwarz inequality. Another useful inequality is
the following
with equality holding if and only if a = ±b. This inequality may be established by noting
that, for any two vectors a and b,
Iia ± bf 2: 0
Expanding the norm we have
2
Iia ± bf = IIal1 ± 2(a, b) + Ilbf 2: 0
from which the inequality easily follows.
One of the uses of the inner product is to represent, in a concise way, the output of a
linear shift-invariant filter. For example, let h (n) be the unit sample response of an FIR filter
of order N - I and let x(n) be the filter input. The filter output is the convolution of hen)
and x(n),
N-I
yen) =L h(k)x(n - k)
k=O
24 BACKGROUND
Therefore, expressing x (n) in vector form using Eq. (2.22) and writing h (n) in vector form
as
h(O) ]
h(l)
h=
[
h(N - 1)
for some set of scalars, f3i. For vectors of dimension N, no more than N vectors may be
linearly independent, i.e., any set containing more than N vectors will always be linearly
dependent.
it is easily shown that v I and V2 are linearly independent. Specifically, note that if we are to
find values for the scalars a and f3 such that
The only solution to these equations is al = a2 = O. Therefore, the vectors are linearly
LINEAR ALGEBRA 25
consider the set of all vectors V that may be formed from a linear combination of the vectors
Vi,
V = LftiVi
i=1
This set forms a vector space, and the vectors Vi are said to span the space V. Furthermore,
if the vectors Vi are linearly independent, then they are said to form a basis for the space
V, and the number of vectors in the basis, N, is referred to as the dimension of the space.
For example, the set of all real vectors of the form x = [XI, X2, ••• , X N ] T forms an N-
dimensional vector space, denoted by R N , that is spanned by the basis vectors
UI [1,0,0, , of
U2 [0, 1, 0, , of
UN [0,0,0, ... , If
V = LViui
i=1
It should be pointed out, however, that the basis for a vector space is not unique.
2.3.3 Matrices
An n x m matrix is an array of numbers (real or complex) or functions having n rows and
m columns. For example,
anI a n2 a n3 a nm
26 BACKGROUND
anm(z)
Xoh =Y
where Xo is a convolution matrix defined by5
x(O) o o o
xCI) x(O) o o
x(2) xCI) x(O) o
Xo =
x(N - 1) x(N - 2) x(N - 3) x(O)
A matrix may also be partitioned into submatrices. For example, an n x m matrix A may
be partitioned as
A = [All A lZ ]
A 2l An
where All is p x q, A l2 is P x (m - q), A 21 is (n - p) x q, and An is (n - p) x (m - q).
5The subscript on Xo is used to indicate the value of the index of x (n) in the first entry of the convolution matrix.
LINEAR ALGEBRA 27
OOl fli l
--nr~:T
00
2 2
I: i
OT ]
A= ----J----------
[ 0:I A 22
A 22 = [~ ~]
is a 2 x 2 matrix.
then A is said to be a symmetric matrix. For complex matrices, the Hermitian transpose is
the complex conjugate of the transpose of A and is denoted by A H. Thus,
A=A H
then the matrix is said to be Hermitian. A few properties of the Hermitian transpose are
1. (A + B)H = AH +BH
2. (AH)H = A
3. (AB)H = B H A H
Equivalent properties for the transpose may be obtained by replacing the Hermitian transpose
H with the transpose T.
then the rank of A is equivalently equal to the number of linearly independent row vectors,
i.e., the number oflinearly independent vectors in the set {rl, r2, ... , r n }. A useful property
of the rank of a matrix is the following:
Since the rank of a matrix is equal to the number of linearly independent rows and the
number of linearly independent columns, it follows that if A is an m x n matrix then
p(A) :s min(m, n)
If A is an m x n matrix and p(A) = min(m, n) then A is said to be ofJull rank. If A is a
square matrix of full rank, then there exists a unique matrix A -I, called the inverse of A,
such that
A-IA = AA- 1 = I
where
I 0 0 0
0 I 0 0
1= 0 0 I 0
0 0 0
is the identity matrix, which has ones along the main diagonal and zeros everywhere else.
In this case A is said to be invertible or nonsingular. If A is not of full rank, p(A) < n, then
it is said to be noninvertible or singular and A does not have an inverse. Some properties
of the matrix inverse are as follows. First, if A and B are invertible, then the inverse of their
product is
Second, the Hermitian transpose of the inverse is equal to the inverse of the Hermitian
transpose
Finally, a formula that will be useful for inverting matrices that arise in adaptive filtering
LINEAR ALGEBRA 29
(2.28)
(2.30)
i=1
where Aij is the (n - 1) x (n - 1) matrix that is formed by deleting the ith row and the jth
column of A.
A=[
the determinant is
the determinant is
The determinant may be used to determine whether or not a matrix is invertible. Specifically,
Some additional properties of the determinant are listed below. It is assumed that A and B
are n x n matrices.
1. det(AB) = det(A) det(B).
T
2. det(A ) = det(A)
3. det(aA) = an det(A) , where a is a constant.
1
4. det(A -I) = -(-) , assuming that A is invertible.
det A
Another function of a matrix that will occasionally be useful is the trace. Given an n x n
matrix, A, the trace is the sum of the terms along the diagonal,
n
tr(A) = L:>ii
i=1
2.3.6 Linear Equations
Many problems addressed in later chapters, such as signal modeling, Wiener filtering, and
spectrum estimation, require finding the solution or solutions to a set of linear equations.
Many techniques are available for solving linear equations and, depending on the form of
the equations, there may be "fast algorithms" for efficiently finding the solution. In addition
to solving the linear equations, however, it is often important to characterize the form of the
solution in terms of existence and uniqueness. Therefore, this section briefly summarizes
the conditions under which a unique solution exists, discusses how constraints might be
imposed if multiple solutions exist, and indicates how an approximate solution may be
obtained if no solution exists.
Consider the following set of n linear equations in the m unknowns Xi, i = 1, 2, ... , m,
Ax=b (2.31)
As discussed below, solving Eq. (2.31) depends upon a number of factors including the
relative size of m and n, the rank of the matrix A, and the elements in the vector b. The case
of a square matrix (m = n) is considered first.
Clearly, the matrix A is singular, det(A) = 0, and no solution exists. However, if the second
equation is modified so that
then there are many solutions. Specifically, note that for any constant ct, the vector
For the case in which A is singular, the columns of A are linearly dependent and there exists
nonzero solutions to the homogeneous equations
Az=O (2.33)
In fact, there will be k = n - peA) linearly independent solutions to the homogeneous
equations. Therefore, if there is at least one vector, xo, that solves Eq. (2.31), then any
vector of the form
32 BACKGROUND
will also be a solution where Zi, 1, 2, ... , k are linearly independent solutions of
Eq. (2.33).
Rectangular Matrix: n < m. If n < m, then there are fewer equations than unknowns.
Therefore, provided the equations are not inconsistent, there are many vectors that satisfy
the equations, i.e., the solution is underdetermined or incompletely specified. One approach
that is often used to define a unique solution is to find the vector satisfying the equations
that has the minimum norm, i.e.,
min Ilxll such that Ax =b
If the rank of A is n (the rows of A are linearly independent), then the n x n matrix AA H
is invertible and the minimum norm solution is [10]
(2.34)
The matrix
is known as the pseudo-inverse of the matrix A for the underdetermined problem. The
following example illustrates how the pseudoinverse is used to solve an underdetermined
set of linear equations.
Xl - X2 + X3 - X4 = 1 (2.35)
This equation may be written in matrix form as
Ax = [1, -1, 1, -l]x = b
where b = 1 and x is the vector containing the unknowns Xi. Clearly, the solution is
incompletely specified since there are many solutions that satisfy this equation. However,
the minimum norm solution is unique and given by Eq. (2.34). Specifically, since
AA H =4
and (AA H) -I = ~ then the minimum norm' solution is
Xl + X2 + X3 + X4 = 1
then there are two equations in four unknowns with
~
-1
A =[ 1
LINEAR ALGEBRA 33
and b = [1, 1] T. Again, it is easy to see that the solution is incompletely specified. Since
AA
H
= [~ ~]
then (AAH)-l = ~I and the minimum norm solution is
Rectangular Matrix: m < n. If m < n, then there are more equations than unknowns
and, in general, no solution exists. In this case, the equations are inconsistent and the solution
is said to be overdetermined. The geometry of this problem is illustrated in Fig. 2.5 for the
case of three equations in two unknowns. Since an arbitrary vector b cannot be represented
in terms of a linear combination of the columns of A as given in Eq. (2.32), the goal is to
find the coefficients X; that produce the best approximation to b,
m
b= Lx;a;
;=1
The approach that is commonly used in this situation is to find the least squares solution,
i.e., the vector x that minimizes the norm of the error
or,
(2.38)
which are known as the normal equations [10]. If the columns of A are linearly independent
(A has full rank), then the matrix A H A is invertible and the least squares solution is
(2.39)
or,
is the pseudo-inverse of the matrix A for the overdetermined problem. Furthermore, the best
approximation b to b is given by the projection of the vector b onto the subspace spanned
by the vectors ai,
(2.40)
or
where
(2.41)
is called the projection matrix. Finally, expanding the square in Eq. (2.36) and using the
orthogonality condition given in Eq. (2.37) it follows that the minimum least squares error
is
(2.42)
The following example illustrates the use of the pseudoinverse to solve an overdetermined
set of linear equations.
Since the rank of A is two, the least squares solution is unique. With
LINEAR ALGEBRA 35
and
-5
6
The error, e, is
e = b - Axo = ~
11
[-1
'
-1
,
3]T
and the least square error is
all a a a
a a22 a 0
A= a a a33 0 (2.43)
a a a ann
1 a a a
a 1 a a
I = diag {1, 1, ... ,I} = a a 1 a
a a a
36 BACKGROUND
If the entries along the diagonal in Eq. (2.43) are replaced with matrices,
o
An
o
then A is said to be a block diagonal matrix. The matrix in Example 2.3.3 is an example of
a 3 x 3 block diagonal matrix.
Another matrix that will be useful is the exchange matrix
o 001
o 010
J= o 100
000
which is symmetric and has ones along the cross-diagonal and zeros everywhere else. Note
that since J2 = I then J is its own inverse. The effect of multiplying a vector v by the
exchange matrix is to reverse the order of the entries, i.e.,
Similarly, if a matrix A is multiplied on the left by the exchange matrix, the effect is to
reverse the order of each column. For example, if
then
Similarly, if A is multiplied on the right by J, then the order of the entries in each row is
reversed,
An upper triangular matrix is a square matrix in which all of the terms below the
diagonal are equal to zero, i.e., with A = {au} then au = 0 for i > j. The following is an
example of a 4 x 4 upper triangular matrix
al2 al3
a22 a23
o a33
o 0
A lower triangular matrix, on the other hand, is a square matrix in which all of the entries
above the diagonal are equal to zero, i.e., au = 0 for i < j. Clearly, the transpose of a
lower triangular matrix is upper triangular and vice versa. Some additional properties of
lower and upper triangular matrices are as follows.
1. The determinant of a lower triangular matrix or an upper triangular matrix is equal
to the product of the terms along the diagonal
det(A) = n
i=1
n
au
A~ [i ~ ! n
Note that all of the entries in a Toeplitz matrix are completely defined once the first column
and the first row have been specified. A convolution matrix is an example of a Toeplitz
matrix. A matrix with a similar property is a Hankel matrix, which has equal elements
along the diagonals that are perpendicular to the main diagonal, i.e.,
for all i < nand j :s n
An example of a 4 x 4 Hankel matrix is
A-l~
3 5
5 7
- 5 7 4
7 4 2
Another example of a Hankel matrix is the exchange matrix J.
Toeplitz matrices are a special case of a larger class of matrices known as persymmetric
matrices. A persymmetric matrix is symmetric about the cross-diagonal, au = an-HI ,n-i+ I.
38 BACKGROUND
I
I 3
5
A = 2 2 4
4 4 2
6 4 2
If a Toeplitz matrix is symmetric, or Hermitian in the case of a complex matrix, then all
of the elements of the matrix are completely determined by either the first row or the first
column of the matrix. An example of a symmetric Toeplitz matrix is
3 5
I 3
3 I
5 3
Since we will be dealing frequently with symmetric Toeplitz and Hermitian Toeplitz ma-
trices it will be convenient to adopt the notation
A = Toep{a(O), a(I), ... , a(p)}
for the Hermitian Toeplitz matrix that has the elements a (0), a (l), ... , a (p) in the first
column. For example,
~
-j
A = Toep {1, j, I - j} = [ I
1
-J
+.1 ]
1- j j I
Symmetric Toeplitz matrices are a special case of a larger class of matrices known as
centrosymmetric matrices. A centrosymmetric matrix is both symmetric and persymmetric.
An example of a centrosymmetric matrix that is not Toeplitz is
A~ U~ ~ n
There are many interesting and useful properties of Toeplitz, persymmetric, and centrosym-
metric matrices. For example, if A is a symmetric Toeplitz matrix, then
JTAJ = A
whereas if A is a Hermitian Toeplitz matrix:, then
JTAJ=A*
Of particular interest will be the following properties that are concerned with the inverse.
• Property 1. The inverse of a symmetric matrix is symmetric.
• Property 2. The inverse of a persymmetric matrix is persymmetric.
• Property 3. The inverse of a Toeplitz matrix is not, in general, Toeplitz. However,
since a Toeplitz matrix is persymmetric, the inverse will always be persymmetric.
Furthermore, the inverse of a symmetric Toeplitz matrix wiIl be centrosymmetric.
These symmetry properties of matrix inverses along with those previously mentioned are
summarized in Table 2.5.
LINEAR ALGEBRA 39
Finally, we conclude with the definition of orthogonal and unitary matrices. A real
n x n matrix is said to be orthogonal if the columns (and rows) are orthonormal. Thus, if
the columns of A are ai,
and
T {I for i = j
a i aj = 0 for i # j
H {I for i = j
a; aj = 0 "lor i -r-
--I-
j
then
and A is said to be a unitary matrix. The inverse of a unitary matrix is equal to its Hermitian
transpose
where x T = [XI, X2, ... , x n ] is a vector of n real variables. Note that the quadratic form is
a quadratic function in the n variables XI, X2, ... , Xn . For example, the quadratic form of
A=[i ~]
is
QA (x) = x T Ax = 3xf + 2X1X2 + 2xi
In a similar fashion, for a Hermitian matrix, the Hermitian form is defined as
n n
QA(X) = x HAx = L LxtaijXj
i=l j=1
A=[~ ~]
which has the quadratic form
A=[~ ~]
is positive semidefinite since Q A(x) = 2xf ~ 0, but it is not positive definite since QA (x) =
o for any vector x of the form x = [0, x2f. A test for positive definiteness is given in
Section 2.3.9 (see Property 3 on p. 43).
In a similar fashion, a matrix is said to be negative definite if QA (x) < 0 for all nonzero
x, whereas it is said to be negative semidefinite if QA (x) ::: 0 for all nonzero x. A matrix
that is none of the above is said to be indefinite.
For any n x n matrix A and for any n x m matrix B having full rank, the definiteness of
A and B H AB will be the same. For example, if A > 0 and B is full rank, then B H AB > O.
This follows by noting that for any vector x,
x H (B H AB) x = (BX)H A (Bx) = v H Av
where v = Bx. Therefore, if A > 0, then v H Av > 0 and B H AB is positive definite (The
constraint that B have full rank ensures that v = Bx is nonzero for any nonzero vector x).
is invertible as well as to indicate how sensitive the determination of the inverse will be to
numerical errors. Eigenvalues and eigenvectors also provide an important representation for
matrices known as the eigenvalue decomposition. This decomposition, developed below,
will be useful in our discussions of spectrum estimation as well as in our study of the con-
vergence properties of adaptive filtering algorithms. This section begins with the definition
of the eigenvalues and eigenvectors of a matrix and then proceeds to review some properties
of eigenvalues and eigenvectors that will be useful in later chapters.
Let A be an n x n matrix and consider the following set of linear equations,
AV=AV (2.44)
where A is a constant. Equation (2.44) may equivalently be expressed as a set of homoge-
neous linear equations of the form
(A - AI) V = 0 (2.45)
In order for a nonzero vector to be a solution to this equation, it is necessary for the matrix
A - AI to be singular. Therefore, the determinant of A - AI must be zero,
peA) = det(A - AI) =0 (2.46)
Note that peA) is an nth-order polynomial in A. This polynomial is called the characteristic
polynomial of the matrix A and the n roots, Ai for i = 1,2, ... , n, are the eigenvalues of
A. For each eigenvalue, Ai, the matrix (A - Ai I) will be singular and there will be at least
one nonzero vector, Vi, that solves Eq. (2.44), i.e.,
(2.47)
These vectors, Vi, are called the eigenvectors of A. For any eigenvector Vi, it is clear that
aVi will also be an eigenvector for any constant a. Therefore, eigenvectors are often nor-
malized so that they have unit norm, Ilvi I = I. The following property establishes the linear
independence of eigenvectors that correspond to distinct eigenvalues.
Property 1. The nonzero eigenvectors VI, V2, ... , Vn corresponding to distinct eigen-
values AI, A2, ... , An are linearly independent.
A=[~ ~]
42 BACKGROUND
In general, it is not possible to say much about the eigenvalues and eigenvectors of a
matrix without knowing something about its properties. However, in the case of symmetric
or Hermitian matrices, the eigenvalues and eigenvectors have several useful and important
properties. For example,
This property is easily established as follows. Let A be a Hermitian matrix with eigenvalue
Ai and eigenvector Vi,
AVi = AiVi
Multiplying this equation on the left by vf
vf AVi = AiVfvi (2.49)
and taking the Hermitian transpose gives
vf AHVi = A7vf v i (2.50)
LINEAR ALGEBRA 43
Property 3. A Hermitian matrix is positive definite, A > 0, if and only if the eigen-
values of A are positive, Ak > O.
Similar properties hold for positive semidefinite, negative definite, and negative semidefinite
matrices. For example, if A is Hermitian and A :::: 0 then Ak :::: O.
The determinant of a matrix is related to its eigenvalues by the following relationship,
det(A) = nn
i=1
Ai (2.52)
Therefore, a matrix is nonsingular (invertible) if and only if all of its eigenvalues are nonzero.
Conversely, if a matrix has one or more zero eigenvalues then it will be singular (noninvert-
ible). As a result, it follows that any positive definite matrix is nonsingular.
In Property 2 we saw that the eigenvalues of a Hermitian matrix are real. The next prop-
erty establishes the orthogonality of the eigenvectors corresponding to distinct eigenvalues.
To establish this property, let Ai and Aj be two distinct eigenvalues of a Hermitian matrix
with eigenvectors Vi and Vj, respectively,
AVi = AiVi
AVj = AjVj (2.53)
Multiplying the first equation on the left by vf and the second by vf gives
Therefore, if Ai =1= Aj, then vlvi = 0 and it follows that Vi and Vj are orthogonal. •
Although stated and verified only for the case of distinct eigenvalues, it is also true that
for any n x n Hermitian matrix there exists a set of n orthonormal eigenvectors [10]. For
example, consider the 2 x 2 identity matrix,
I=[~ ~]
which has two eigenvalues that are equal to one. Since any vector is an eigenvector of I,
then V\ = [1 , or r
and V2 = [0, 1 is one possible set of orthonormal eigenvectors.
For any n x n matrix A having a set of n linearly independent eigenvectors we may
perform an eigenvalue decomposition that expresses A in the form
A = VAV- I (2.58)
where V is a matrix that contains the eigenvectors of A and A is a diagonal matrix that
contains the eigenvalues. This decomposition is performed as follows. Let A be an n x n
matrix with eigenvalues Ak and eigenvectors Vb
k = 1,2, ... , n
These n equations may be written in matrix form as
A[v\, V2, ... , v n ] = [AIV\, A2V2, ... , AnVn ] (2.59)
Therefore, with
and
Note that the invertibility of A guarantees that Ai =1= 0 so this sum is always well defined.
In many signal processing applications one finds that a matrix B is singular or ill con-
ditioned (one or more eigenvalues are close to zero). In these cases, we may sometimes
stabilize the problem by adding a constant to each term along the diagonal,
A = B+aI
The effect of this operation is to leave the eigenvectors of B unchanged while changing the
eigenvalues of B from Ak to Ak + a. To see this, note that if Vk is an eigenvector of B with
eigenvalue Ak, then
AVk = BVk + aVk+ a)vk
= (Ak
Therefore, Vk is also an eigenvector of A with eigenvalue Ak + a. This result is summarized
in the following property for future reference.
The following example considers finding the inverse of a matrix of the form A = B + aI
where B is a noninvertible Hermitian matrix of rank one.
6We will encounter matrices of this form in Chapter 8 when we consider the autocorrelation matrix of a random
process consisting of a complex exponential in white noise.
46 BACKGROUND
_I I H ~ 1 H
A = --UIUI + £.....J -Vi Vi (2.62)
ex +A i=2 ex
Since the n orthonormal eigenvectors of A also form an orthonormal set of eigenvectors for
I, then the spectral theorem applied to 1 gives
n
1 = LViVP
i=1
YI vfx
y=
[ ][ ]
Y2
Yn
vIx
v~x
=yTx (2.65)